“Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts... A graphic representation of data abstracted from banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data. Like city lights, receding..."
- William Gibson, Neuromancer

Thursday, February 7, 2013

Rest In Peace - Aaron Swartz

We will always remember you - Aaron Swartz

Early architect of Creative Commons License (Giving power to open knowledge and bloggers) & numerous other ventures which actually contributed to Open Knowledge community and "Internet Freedom".

Links -
Remembering Aaron Swartz - creativecommons.org
Remember Aaron Swartz - rememberaaronsw.com

Saturday, September 8, 2012

Infiniband Interconnect Terminologies Demystified

Hey Guys, just a short post about Infiniband (IB) terms and actual data rate possible through IB interconnects. I hate it when Marketing/Sales people in HPC field use signaling rate instead of actual/effective data rate possible through these interconnects at link layer. In context of this post I am dropping the other overheads in the stack, which decreases the speed further at OS layer. Days of pure technology are lost in the mist of Capitalism. So understand the terms clearly and give "In Your Face" responses to these people who try to mislead clients into black holes by blabbering about fake link speeds.

Infiniband switches perform cut-through method of switching to achieve the ultra-low latency. However, in some cases store-forward method is used we will focus on this at the end of the post. Consider the table in the following image, the base IB rate is 2.5 Gbps which is SDR. Remember always that link speed & link width goes hand in hand. Correct data rate can only be described with use of both the link speed and link width. As of today possible link speeds are 1X,4X & 12X. Never say "QDR" in conversation to avoid effective data rate confusions, always use "4X QDR" if you mean to communicate a signaling rate of 10 Gbps and effective data rate of 32 Gbps full duplex. Marketing/Sales people tend to use 40 Gbps for 4X QDR which is utterly baseless, QDR indicates a standard signaling rate of 10 Gbps, you don't multiply 4 and 10 to get 40 Gbps. 4X means 4 lanes of QDR capability, resulting in multiplexing the data over 4 lanes to achieve 32 Gbps effective speed after 8B/10B encoding. So in "4X QDR" you get 32 Gbits/sec of actual data rate in transmit and receive, in short 32 Gbps full duplex. So the core difference is of signaling rate in SDR,DDR,QDR & FDR connections. Signaling rate defines the quantity of bits which can signaled at single instance. FDR achieves more efficiency by using 64B/66B encoding as compared to 8B/10B encoding.

1X Transmit physical lane consist of one differential pair, i.e. two wires.
1X Recieve physical lane consist of one differential pair, i.e. two wires.

Infiniband Speed/Width/Encoding/Lanes/Wires

Infiniband protocol transfers data in serial fashion. Pure serial transmission is attempted on 1X Link. Speed of transmission depends upon the signaling rate or link width. Links above 1X capability multiplexes data to achieve parallel data transmission by transmitting chunks of frame in serial mode over multiple lanes.

Note: - Do not confuse "Lanes" over Virtual Lanes (VL's) in IB, this topic is out of scope for this post (someday I will post about it,don't worry). In short, VL is application layer abstraction to allow multiple applications to subscribe to Work Request Queue's in IB with same physical lanes.For the sake of this post, consider Lane as Physical Lane.

Image below gives a clear idea of how data is transmitted on 1X SDR link & 4X SDR link, same analogy applies to other combinations of link speeds and link widths.

Infiniband Data Transmission Pattern

Infiniband Differential Pairs in 1X & 4X

It is recommended to have a uniform link speed/width in one fabric domain in HPC environment to achieve optimum performance & low latency. Connecting 4X QDR capable HCA (Host Channel Adapter) to 1X QDR capable HCA results in Store-Forward switching as frame needs to assembled completely first in the buffer of one HCA and then forwarded to 1X QDR HCA in serial fashion. If IB network architecture & devices are capable of handling few lower link connections without affecting the latency of uniform speed connections, then it is fine to have mixed link speeds/widths.

Friday, August 10, 2012

CPU SpeedStep Frequency Scaling Performance Impact due to Transition Latency

Working on HPC systems, I have found that applications which are I/O bound and perform small but consistent reads/writes on disk suffer from performance hit with Linux default "ondemand" CPU frequency scaling governor. I observed this for Oracle DB queries on CPU affinity based Virtuozzo Containers on SGI UV1000 systems, so I decided to simulate the I/O wait using C code for Parallel Matrix Multiplication I wrote for OpenMP Schedule Clause Performance Analysis which can be found at following link.


I have introduced nanosecond delays in the pragma for construct of OpenMP. Because of this each thread waits for the specified interval which allows the CPU to do context switch to new demanding process. To allow CPU to do process context switch instead of thread context switch I fired 3 processes with single thread on single core. The system I am using for benchmark is cc-NUMA system so to avoid distant memory access and confine the 3 processes to single core to increase processing pressure I have used "numactl" tool. Small shell script allows to fire the processes simultaneously with NUMA Localalloc policy and required command line arguments for code to operate. 

Code snippet for artificial wait state in #pragma construct :
    91  #pragma omp parallel shared(mul,m1,m2) private(threads_id)
    92  {
    93          /*Report thread number*/
    94  threads_id = omp_get_thread_num();
    95  if (threads_id == 0)
    96          {
    97          /*Master thread will report total number of threads invoked*/
    98          tot_threads = omp_get_num_threads();
    99          printf("Total worker threads invoked = %d\n",tot_threads);
   100          }
   101          /*Parallel for loop directive with dynamic, chunk size schedule policy.
   102            Private variables for parallel loop construct.
   103            schedule options:
   104            1) schedule (static)
   105            2) schedule (static, chunk)
   106            3) schedule (dynamic)
   107            4) schedule (dynamic,chunk)
   108          */
   109  #pragma omp for schedule (dynamic, chunk) nowait private(i,j,k,temp)
   110          /*Outer loop Row parallelization*/
   111  for(i=0;i<mat_size;i++)
   112          {
   113          /*printf("Thread=%d row=%d completed\n",threads_id,i);*/
   114          for(j=0;j<mat_size;j++)
   115          {
   116          temp = 0;
   117          for(k=0;k<mat_size;k++)
   118                  {
   119                  temp+=m1[i][k]*m2[k][j];
   120                  }
   121          mul[i][j] = temp;
   122          }
   123          nanosleep(&t,NULL);
   124          }
   125  }

Complete modified source code can be found here.

Numactl wrapper script :
1  numactl --localalloc --physcpubind=8 ./a.out 1 700 5000000 &
2  numactl --localalloc --physcpubind=8 ./a.out 1 800 7000000 &
3  numactl --localalloc --physcpubind=8 ./a.out 1 900 9000000 &

Arguments :
  • localalloc : Allocate memory on local NUMA node, i.e. from node of core 8 (socket 1).
  • physcpubind : CPU core 8 bind of all processes.
  • Process argument : 1 = single thread, 700 = matrix Size, 5000000 = nano second delay.

"cpufreq_stats" module to get Transition State statistics :
To get the statistics of frequency transition states, we need to load the stats module of cpufreq for benchmark purpose. It is not recommended to keep this module loaded all the time in production system as it uses significant amount of CPU cycles.

Loading of module before executing benchmark :
modprobe cpufreq_stats
Unloading of module after benchmark :
rmmod cpufreq_stats
Stats for CPU core 8 path :
/sys/devices/system/cpu/cpu8/cpufreq/stats

Results on RHEL 6.1 with "cpuspeed" module to control governor:

CPU Transition Latency Impact Graph

Note : Sampling Rate = 10000 usec. Maximum delay introduced in the process of 900 matrix multiplication is 9000000 nsec = 9000 usec, this is less than sampling rate causing core utilization to go down, reducing average CPU utilization for the period of sampling rate, resulting in lowering of frequency. This is true in respect of single process, other process can demand core power at the same instance. So if multiple processes are striving for CPU power then it is total chaos. More details on Sampling Rate are specified below.

 Frequency Transition Counts from the trans_table file in stats :                                 
Frequency
No.
From
To 2661000
2661000
0
2660000
19
2527000
17
2394000
28
2261000
68
2128000
110
1995000
41
1862000
48
1729000
72
1596000
49
1463000
47
1330000
22
1197000
254
1064000
40

Observations :
  • Graph corroborates that there is performance increase while using "performance" CPU Frequency Governor. 
  • As you can see we are using single core with "ondemand" & "performance" governor. CPU core 8 is running 3 single-threaded processes with different processing requirements (different matrix sizes) & different simulated I/O delays. 
  • Linux scheduler performs a process context switch depending upon the overlap of processing power required by one of the three processes and delays introduced by them in that point of time. 
  • Intel CPU used by me for this benchmark has above mentioned SpeedStep frequencies. Total of 14 states are supported by the CPU. Depending upon the processing power required by process owning the CPU at that instance, "ondemand" governor reduces/increases frequency in steps. 
  • This introduces the small latency while CPU transitions state. Tweaking of parameters related to transition latency are beyond the scope of this article but I will try to cover most of the important matter here. 
  • It is possible to tweak the related parameters of "ondemand" mode to match best combination for reduced power consumption and desired performance. HPC systems are performance hungry, so it is recommended to keep all cores clocked up to the maximum supported frequency using "performance" governor. 
  • Keeping CPU clocked up at max doesn't necessarily mean that CPU is heating up to threshold level, the voltage level controls the frequency and some instructions don't use much CPU power per tick resulting in low overall utilization. 
  • CPU at maxed out clock with no instruction to execute runs HLT instruction in great proportion to suspend operation in part of CPU's using less energy.
  • According to my observation, if a process needs full CPU core processing power and no context switch is going to happen as single process/thread is latched on to the core and frequency is clocked to max by "ondemand" governor, then we don't not see much of the transition latency impact. 
  • On the other hand, if the process is waiting for I/O to happen or any other event, scheduler context switches to other process and for the time difference between saving previous process stack and loading new demanding process stack governor reduces frequency in steps and increases if new process demands it. This whole sequence happening frequently can cause governor to reduce/increase frequency frequently causing transition latency.
  • In "performance" governor there is no need to reduce/increase frequency as instructions gets processed at the same rate of maximum clock. Process demanding moderate CPU power intermittently can suffer from "ondemand" governor. What I mean to say is completely processing bound job will not suffer from "ondemand" governor, because frequency is maxed out in minimum transitions depending upon utilization in specified time of sampling rate.
"ondemand" governor configuration on my benchmark server :
cpuinfo_transition_latency = 10000 nsec
sampling rate = 10000 usec
up_threshold = 95
  • Transition Latency : Indicates transition time required.
  • Sampling Rate : Kernel checks CPU usage and makes decisions to increase/decrease frequency.
  • Up Threshold : This indicates the average CPU utilization in the time period of sampling rate with current frequency, above this kernel takes decision to increase frequency.
Note : Sampling Rate = Transition Latency * 1000