Friday, August 10, 2012

CPU SpeedStep Frequency Scaling Performance Impact due to Transition Latency

Working on HPC systems, I have found that applications which are I/O bound and perform small but consistent reads/writes on disk suffer from performance hit with Linux default "ondemand" CPU frequency scaling governor. I observed this for Oracle DB queries on CPU affinity based Virtuozzo Containers on SGI UV1000 systems, so I decided to simulate the I/O wait using C code for Parallel Matrix Multiplication I wrote for OpenMP Schedule Clause Performance Analysis which can be found at following link.


I have introduced nanosecond delays in the pragma for construct of OpenMP. Because of this each thread waits for the specified interval which allows the CPU to do context switch to new demanding process. To allow CPU to do process context switch instead of thread context switch I fired 3 processes with single thread on single core. The system I am using for benchmark is cc-NUMA system so to avoid distant memory access and confine the 3 processes to single core to increase processing pressure I have used "numactl" tool. Small shell script allows to fire the processes simultaneously with NUMA Localalloc policy and required command line arguments for code to operate. 

Code snippet for artificial wait state in #pragma construct :
    91  #pragma omp parallel shared(mul,m1,m2) private(threads_id)
    92  {
    93          /*Report thread number*/
    94  threads_id = omp_get_thread_num();
    95  if (threads_id == 0)
    96          {
    97          /*Master thread will report total number of threads invoked*/
    98          tot_threads = omp_get_num_threads();
    99          printf("Total worker threads invoked = %d\n",tot_threads);
   100          }
   101          /*Parallel for loop directive with dynamic, chunk size schedule policy.
   102            Private variables for parallel loop construct.
   103            schedule options:
   104            1) schedule (static)
   105            2) schedule (static, chunk)
   106            3) schedule (dynamic)
   107            4) schedule (dynamic,chunk)
   108          */
   109  #pragma omp for schedule (dynamic, chunk) nowait private(i,j,k,temp)
   110          /*Outer loop Row parallelization*/
   111  for(i=0;i<mat_size;i++)
   112          {
   113          /*printf("Thread=%d row=%d completed\n",threads_id,i);*/
   114          for(j=0;j<mat_size;j++)
   115          {
   116          temp = 0;
   117          for(k=0;k<mat_size;k++)
   118                  {
   119                  temp+=m1[i][k]*m2[k][j];
   120                  }
   121          mul[i][j] = temp;
   122          }
   123          nanosleep(&t,NULL);
   124          }
   125  }

Complete modified source code can be found here.

Numactl wrapper script :
1  numactl --localalloc --physcpubind=8 ./a.out 1 700 5000000 &
2  numactl --localalloc --physcpubind=8 ./a.out 1 800 7000000 &
3  numactl --localalloc --physcpubind=8 ./a.out 1 900 9000000 &

Arguments :
  • localalloc : Allocate memory on local NUMA node, i.e. from node of core 8 (socket 1).
  • physcpubind : CPU core 8 bind of all processes.
  • Process argument : 1 = single thread, 700 = matrix Size, 5000000 = nano second delay.

"cpufreq_stats" module to get Transition State statistics :
To get the statistics of frequency transition states, we need to load the stats module of cpufreq for benchmark purpose. It is not recommended to keep this module loaded all the time in production system as it uses significant amount of CPU cycles.

Loading of module before executing benchmark :
modprobe cpufreq_stats
Unloading of module after benchmark :
rmmod cpufreq_stats
Stats for CPU core 8 path :
/sys/devices/system/cpu/cpu8/cpufreq/stats

Results on RHEL 6.1 with "cpuspeed" module to control governor:

CPU Transition Latency Impact Graph

Note : Sampling Rate = 10000 usec. Maximum delay introduced in the process of 900 matrix multiplication is 9000000 nsec = 9000 usec, this is less than sampling rate causing core utilization to go down, reducing average CPU utilization for the period of sampling rate, resulting in lowering of frequency. This is true in respect of single process, other process can demand core power at the same instance. So if multiple processes are striving for CPU power then it is total chaos. More details on Sampling Rate are specified below.

 Frequency Transition Counts from the trans_table file in stats :                                 
Frequency
No.
From
To 2661000
2661000
0
2660000
19
2527000
17
2394000
28
2261000
68
2128000
110
1995000
41
1862000
48
1729000
72
1596000
49
1463000
47
1330000
22
1197000
254
1064000
40

Observations :
  • Graph corroborates that there is performance increase while using "performance" CPU Frequency Governor. 
  • As you can see we are using single core with "ondemand" & "performance" governor. CPU core 8 is running 3 single-threaded processes with different processing requirements (different matrix sizes) & different simulated I/O delays. 
  • Linux scheduler performs a process context switch depending upon the overlap of processing power required by one of the three processes and delays introduced by them in that point of time. 
  • Intel CPU used by me for this benchmark has above mentioned SpeedStep frequencies. Total of 14 states are supported by the CPU. Depending upon the processing power required by process owning the CPU at that instance, "ondemand" governor reduces/increases frequency in steps. 
  • This introduces the small latency while CPU transitions state. Tweaking of parameters related to transition latency are beyond the scope of this article but I will try to cover most of the important matter here. 
  • It is possible to tweak the related parameters of "ondemand" mode to match best combination for reduced power consumption and desired performance. HPC systems are performance hungry, so it is recommended to keep all cores clocked up to the maximum supported frequency using "performance" governor. 
  • Keeping CPU clocked up at max doesn't necessarily mean that CPU is heating up to threshold level, the voltage level controls the frequency and some instructions don't use much CPU power per tick resulting in low overall utilization. 
  • CPU at maxed out clock with no instruction to execute runs HLT instruction in great proportion to suspend operation in part of CPU's using less energy.
  • According to my observation, if a process needs full CPU core processing power and no context switch is going to happen as single process/thread is latched on to the core and frequency is clocked to max by "ondemand" governor, then we don't not see much of the transition latency impact. 
  • On the other hand, if the process is waiting for I/O to happen or any other event, scheduler context switches to other process and for the time difference between saving previous process stack and loading new demanding process stack governor reduces frequency in steps and increases if new process demands it. This whole sequence happening frequently can cause governor to reduce/increase frequency frequently causing transition latency.
  • In "performance" governor there is no need to reduce/increase frequency as instructions gets processed at the same rate of maximum clock. Process demanding moderate CPU power intermittently can suffer from "ondemand" governor. What I mean to say is completely processing bound job will not suffer from "ondemand" governor, because frequency is maxed out in minimum transitions depending upon utilization in specified time of sampling rate.
"ondemand" governor configuration on my benchmark server :
cpuinfo_transition_latency = 10000 nsec
sampling rate = 10000 usec
up_threshold = 95
  • Transition Latency : Indicates transition time required.
  • Sampling Rate : Kernel checks CPU usage and makes decisions to increase/decrease frequency.
  • Up Threshold : This indicates the average CPU utilization in the time period of sampling rate with current frequency, above this kernel takes decision to increase frequency.
Note : Sampling Rate = Transition Latency * 1000

Monday, August 6, 2012

Infiniband Xmit and Rcv Link Counter Interpretation using Perfquery

Infiniband Interconnect technology is preferred in HPC domain, because of it's ultra low latency (usec), high bandwidth using multiple lanes and RDMA (Remote DMA) capability. Just like Ethernet it has multiple layers and can carry data at SRP (SCSI RDMA) layer as well as IPoIB (IP over Infiniband) layer at the same time. I wanted to calculate the amount of data transferred on 4 IB ports connected to my server. Following is the quick and dirty method to do so for QDR IB ports. Infiniband-diags package on RHEL 6.1 comes with perfquery utility.

Perfquery wrapper 1.sh script to report the port counters for Mellanox HCA's :
echo "mlx4_0 P 1 - ib0 port - IPoIB Bond Master Interface"
perfquery -C mlx4_0 -P 1
echo "mlx4_0 P 2 - ib1 port"
perfquery -C mlx4_0 -P 2
echo "mlx4_1 P 1 - ib2 port"
perfquery -C mlx4_1 -P 1
echo "mlx4_1 P 2 - ib3 port - IPoIB Bond Slave Interface"
perfquery -C mlx4_1 -P 2

Monitoring the ports for activity :
Grep the desired output and put watch on it, as follows.
watch -n 0.5 './1.sh | grep -E "mlx|XmitData|RcvData|XmitPkts|RcvPkts"'

2.sh script to clear the port counters :
perfquery -C mlx4_0 -P 1 -R 
perfquery -C mlx4_0 -P 2 -R 
perfquery -C mlx4_1 -P 1 -R 
perfquery -C mlx4_1 -P 2 -R 
echo "Counters Cleared"

How to interpret the actual data transferred on individual links of Infiniband :
  • PortXmitData/PortRcvData shows the actual data Transmitted/Received on the port. Command output of perfquery indicates values which are divided by 4, because of QDR IB links having 4 links to span data chunks across it.This counters excludes link packets. To get actual data transfer on it in MBytes/sec, follow the given procedure. Suppose I did a dd un-buffered (oflag=direct) write to CXFS storage vault through IB network of size 200 MB, the data of 200 MB is spanned across multiple IB ports and committed to storage controller acting as a SRP target. The CXFS vault I am using for testing purpose is a volume made up of multiple LUN Slices. These LUN Slices are having different XVM preferred paths (failover2.conf) owned by distinct storage controllers. Direct disk platter commit is necessary otherwise there are multiple layers of buffers involved in the whole stack to improve performance. Data gets written into RAM file cache first and gets "sync"* to disk. IB HCA's have their own buffers , storage controllers are having their own caches to improve performance. However we don't get actual data transmitted on IB ports if Linux is using it's own file/buffer cache.
             * - "sync" is literal command in Linux to commit all buffered data to disk, system call related to "sync" is fsync() which can be used in programs.

dd command :
dd if=/dev/zero of=/testvol1/testfile.dat bs=10M count=20 oflag=direct

Calculation for QDR :
XmitData on 2 out of 4 ports is 26564832
mlx4_1 P 1 =((26564832*4)/1024)/1024 = 101.33 MBytes
mlx4_1 P 2 =((26583984*4)/1024)/1024 = 101.40 MBytes
  • XmitPkts/RcvPkts shows the total number of packets transferred on the links. Above XmitData/RcvData is encapsulated on top of these IB pkts. Remember that this excludes the link packets.
  • All counters maintained are per Infiniband link (Single X in 1X/4X/12X) so divide/multiply accordingly.
  • The base data rate of of IB is 1X clocked at 2.5 Gbps and is transmitted over two pairs of transmit & receive and yields an effective data rate of 2 Gbps full duplex (2 Gbps transmit,2 Gbps receive).
  • IB 4X, 12X interfaces uses same base clock rate, but uses multiple pairs where each pair is commonly referred as lane.
  • 4X IB gets a signalling rate of 10 Gbps (8 Gbps data rate) using 4 lane = 8 pairs.
InfiniBand Link Signal Pairs Signaling Rate Data Rate Full-Duplex Data Rate
1X 2 2.5 Gbps 2.0 Gbps 4.0 Gbps
4X 8 10 Gbps 8 Gbps 16 Gbps
12X 24 30 Gbps 24 Gbps 48 Gbps

            Note: Although the signaling rate is 2.5 Gbps, the effective data rate is limited to 2 Gbps, due
to the 8B/10B encoding scheme; i.e., (2.5*8)/10 = 2 Gbps.
  • The data shown in the XmitData/RcvData/XmitPkts/RcvPkts shows the data for the all the IB protocols like SRP/IPoIB/SDP(Sockets Direct Protocol)/etc. This totals the amount of data transmitted on IB stack regardless of layered protocols.