Friday, August 10, 2012

CPU SpeedStep Frequency Scaling Performance Impact due to Transition Latency

Working on HPC systems, I have found that applications which are I/O bound and perform small but consistent reads/writes on disk suffer from performance hit with Linux default "ondemand" CPU frequency scaling governor. I observed this for Oracle DB queries on CPU affinity based Virtuozzo Containers on SGI UV1000 systems, so I decided to simulate the I/O wait using C code for Parallel Matrix Multiplication I wrote for OpenMP Schedule Clause Performance Analysis which can be found at following link.


I have introduced nanosecond delays in the pragma for construct of OpenMP. Because of this each thread waits for the specified interval which allows the CPU to do context switch to new demanding process. To allow CPU to do process context switch instead of thread context switch I fired 3 processes with single thread on single core. The system I am using for benchmark is cc-NUMA system so to avoid distant memory access and confine the 3 processes to single core to increase processing pressure I have used "numactl" tool. Small shell script allows to fire the processes simultaneously with NUMA Localalloc policy and required command line arguments for code to operate. 

Code snippet for artificial wait state in #pragma construct :
    91  #pragma omp parallel shared(mul,m1,m2) private(threads_id)
    92  {
    93          /*Report thread number*/
    94  threads_id = omp_get_thread_num();
    95  if (threads_id == 0)
    96          {
    97          /*Master thread will report total number of threads invoked*/
    98          tot_threads = omp_get_num_threads();
    99          printf("Total worker threads invoked = %d\n",tot_threads);
   100          }
   101          /*Parallel for loop directive with dynamic, chunk size schedule policy.
   102            Private variables for parallel loop construct.
   103            schedule options:
   104            1) schedule (static)
   105            2) schedule (static, chunk)
   106            3) schedule (dynamic)
   107            4) schedule (dynamic,chunk)
   108          */
   109  #pragma omp for schedule (dynamic, chunk) nowait private(i,j,k,temp)
   110          /*Outer loop Row parallelization*/
   111  for(i=0;i<mat_size;i++)
   112          {
   113          /*printf("Thread=%d row=%d completed\n",threads_id,i);*/
   114          for(j=0;j<mat_size;j++)
   115          {
   116          temp = 0;
   117          for(k=0;k<mat_size;k++)
   118                  {
   119                  temp+=m1[i][k]*m2[k][j];
   120                  }
   121          mul[i][j] = temp;
   122          }
   123          nanosleep(&t,NULL);
   124          }
   125  }

Complete modified source code can be found here.

Numactl wrapper script :
1  numactl --localalloc --physcpubind=8 ./a.out 1 700 5000000 &
2  numactl --localalloc --physcpubind=8 ./a.out 1 800 7000000 &
3  numactl --localalloc --physcpubind=8 ./a.out 1 900 9000000 &

Arguments :
  • localalloc : Allocate memory on local NUMA node, i.e. from node of core 8 (socket 1).
  • physcpubind : CPU core 8 bind of all processes.
  • Process argument : 1 = single thread, 700 = matrix Size, 5000000 = nano second delay.

"cpufreq_stats" module to get Transition State statistics :
To get the statistics of frequency transition states, we need to load the stats module of cpufreq for benchmark purpose. It is not recommended to keep this module loaded all the time in production system as it uses significant amount of CPU cycles.

Loading of module before executing benchmark :
modprobe cpufreq_stats
Unloading of module after benchmark :
rmmod cpufreq_stats
Stats for CPU core 8 path :
/sys/devices/system/cpu/cpu8/cpufreq/stats

Results on RHEL 6.1 with "cpuspeed" module to control governor:

CPU Transition Latency Impact Graph

Note : Sampling Rate = 10000 usec. Maximum delay introduced in the process of 900 matrix multiplication is 9000000 nsec = 9000 usec, this is less than sampling rate causing core utilization to go down, reducing average CPU utilization for the period of sampling rate, resulting in lowering of frequency. This is true in respect of single process, other process can demand core power at the same instance. So if multiple processes are striving for CPU power then it is total chaos. More details on Sampling Rate are specified below.

 Frequency Transition Counts from the trans_table file in stats :                                 
Frequency
No.
From
To 2661000
2661000
0
2660000
19
2527000
17
2394000
28
2261000
68
2128000
110
1995000
41
1862000
48
1729000
72
1596000
49
1463000
47
1330000
22
1197000
254
1064000
40

Observations :
  • Graph corroborates that there is performance increase while using "performance" CPU Frequency Governor. 
  • As you can see we are using single core with "ondemand" & "performance" governor. CPU core 8 is running 3 single-threaded processes with different processing requirements (different matrix sizes) & different simulated I/O delays. 
  • Linux scheduler performs a process context switch depending upon the overlap of processing power required by one of the three processes and delays introduced by them in that point of time. 
  • Intel CPU used by me for this benchmark has above mentioned SpeedStep frequencies. Total of 14 states are supported by the CPU. Depending upon the processing power required by process owning the CPU at that instance, "ondemand" governor reduces/increases frequency in steps. 
  • This introduces the small latency while CPU transitions state. Tweaking of parameters related to transition latency are beyond the scope of this article but I will try to cover most of the important matter here. 
  • It is possible to tweak the related parameters of "ondemand" mode to match best combination for reduced power consumption and desired performance. HPC systems are performance hungry, so it is recommended to keep all cores clocked up to the maximum supported frequency using "performance" governor. 
  • Keeping CPU clocked up at max doesn't necessarily mean that CPU is heating up to threshold level, the voltage level controls the frequency and some instructions don't use much CPU power per tick resulting in low overall utilization. 
  • CPU at maxed out clock with no instruction to execute runs HLT instruction in great proportion to suspend operation in part of CPU's using less energy.
  • According to my observation, if a process needs full CPU core processing power and no context switch is going to happen as single process/thread is latched on to the core and frequency is clocked to max by "ondemand" governor, then we don't not see much of the transition latency impact. 
  • On the other hand, if the process is waiting for I/O to happen or any other event, scheduler context switches to other process and for the time difference between saving previous process stack and loading new demanding process stack governor reduces frequency in steps and increases if new process demands it. This whole sequence happening frequently can cause governor to reduce/increase frequency frequently causing transition latency.
  • In "performance" governor there is no need to reduce/increase frequency as instructions gets processed at the same rate of maximum clock. Process demanding moderate CPU power intermittently can suffer from "ondemand" governor. What I mean to say is completely processing bound job will not suffer from "ondemand" governor, because frequency is maxed out in minimum transitions depending upon utilization in specified time of sampling rate.
"ondemand" governor configuration on my benchmark server :
cpuinfo_transition_latency = 10000 nsec
sampling rate = 10000 usec
up_threshold = 95
  • Transition Latency : Indicates transition time required.
  • Sampling Rate : Kernel checks CPU usage and makes decisions to increase/decrease frequency.
  • Up Threshold : This indicates the average CPU utilization in the time period of sampling rate with current frequency, above this kernel takes decision to increase frequency.
Note : Sampling Rate = Transition Latency * 1000

No comments:

Post a Comment