Saturday, September 8, 2012

Infiniband Interconnect Terminologies Demystified

Hey Guys, just a short post about Infiniband (IB) terms and actual data rate possible through IB interconnects. I hate it when Marketing/Sales people in HPC field use signaling rate instead of actual/effective data rate possible through these interconnects at link layer. In context of this post I am dropping the other overheads in the stack, which decreases the speed further at OS layer. Days of pure technology are lost in the mist of Capitalism. So understand the terms clearly and give "In Your Face" responses to these people who try to mislead clients into black holes by blabbering about fake link speeds.

Infiniband switches perform cut-through method of switching to achieve the ultra-low latency. However, in some cases store-forward method is used we will focus on this at the end of the post. Consider the table in the following image, the base IB rate is 2.5 Gbps which is SDR. Remember always that link speed & link width goes hand in hand. Correct data rate can only be described with use of both the link speed and link width. As of today possible link speeds are 1X,4X & 12X. Never say "QDR" in conversation to avoid effective data rate confusions, always use "4X QDR" if you mean to communicate a signaling rate of 10 Gbps and effective data rate of 32 Gbps full duplex. Marketing/Sales people tend to use 40 Gbps for 4X QDR which is utterly baseless, QDR indicates a standard signaling rate of 10 Gbps, you don't multiply 4 and 10 to get 40 Gbps. 4X means 4 lanes of QDR capability, resulting in multiplexing the data over 4 lanes to achieve 32 Gbps effective speed after 8B/10B encoding. So in "4X QDR" you get 32 Gbits/sec of actual data rate in transmit and receive, in short 32 Gbps full duplex. So the core difference is of signaling rate in SDR,DDR,QDR & FDR connections. Signaling rate defines the quantity of bits which can signaled at single instance. FDR achieves more efficiency by using 64B/66B encoding as compared to 8B/10B encoding.

1X Transmit physical lane consist of one differential pair, i.e. two wires.
1X Recieve physical lane consist of one differential pair, i.e. two wires.

Infiniband Speed/Width/Encoding/Lanes/Wires

Infiniband protocol transfers data in serial fashion. Pure serial transmission is attempted on 1X Link. Speed of transmission depends upon the signaling rate or link width. Links above 1X capability multiplexes data to achieve parallel data transmission by transmitting chunks of frame in serial mode over multiple lanes.

Note: - Do not confuse "Lanes" over Virtual Lanes (VL's) in IB, this topic is out of scope for this post (someday I will post about it,don't worry). In short, VL is application layer abstraction to allow multiple applications to subscribe to Work Request Queue's in IB with same physical lanes.For the sake of this post, consider Lane as Physical Lane.

Image below gives a clear idea of how data is transmitted on 1X SDR link & 4X SDR link, same analogy applies to other combinations of link speeds and link widths.

Infiniband Data Transmission Pattern

Infiniband Differential Pairs in 1X & 4X

It is recommended to have a uniform link speed/width in one fabric domain in HPC environment to achieve optimum performance & low latency. Connecting 4X QDR capable HCA (Host Channel Adapter) to 1X QDR capable HCA results in Store-Forward switching as frame needs to assembled completely first in the buffer of one HCA and then forwarded to 1X QDR HCA in serial fashion. If IB network architecture & devices are capable of handling few lower link connections without affecting the latency of uniform speed connections, then it is fine to have mixed link speeds/widths.

Friday, August 10, 2012

CPU SpeedStep Frequency Scaling Performance Impact due to Transition Latency

Working on HPC systems, I have found that applications which are I/O bound and perform small but consistent reads/writes on disk suffer from performance hit with Linux default "ondemand" CPU frequency scaling governor. I observed this for Oracle DB queries on CPU affinity based Virtuozzo Containers on SGI UV1000 systems, so I decided to simulate the I/O wait using C code for Parallel Matrix Multiplication I wrote for OpenMP Schedule Clause Performance Analysis which can be found at following link.

I have introduced nanosecond delays in the pragma for construct of OpenMP. Because of this each thread waits for the specified interval which allows the CPU to do context switch to new demanding process. To allow CPU to do process context switch instead of thread context switch I fired 3 processes with single thread on single core. The system I am using for benchmark is cc-NUMA system so to avoid distant memory access and confine the 3 processes to single core to increase processing pressure I have used "numactl" tool. Small shell script allows to fire the processes simultaneously with NUMA Localalloc policy and required command line arguments for code to operate. 

Code snippet for artificial wait state in #pragma construct :
    91  #pragma omp parallel shared(mul,m1,m2) private(threads_id)
    92  {
    93          /*Report thread number*/
    94  threads_id = omp_get_thread_num();
    95  if (threads_id == 0)
    96          {
    97          /*Master thread will report total number of threads invoked*/
    98          tot_threads = omp_get_num_threads();
    99          printf("Total worker threads invoked = %d\n",tot_threads);
   100          }
   101          /*Parallel for loop directive with dynamic, chunk size schedule policy.
   102            Private variables for parallel loop construct.
   103            schedule options:
   104            1) schedule (static)
   105            2) schedule (static, chunk)
   106            3) schedule (dynamic)
   107            4) schedule (dynamic,chunk)
   108          */
   109  #pragma omp for schedule (dynamic, chunk) nowait private(i,j,k,temp)
   110          /*Outer loop Row parallelization*/
   111  for(i=0;i<mat_size;i++)
   112          {
   113          /*printf("Thread=%d row=%d completed\n",threads_id,i);*/
   114          for(j=0;j<mat_size;j++)
   115          {
   116          temp = 0;
   117          for(k=0;k<mat_size;k++)
   118                  {
   119                  temp+=m1[i][k]*m2[k][j];
   120                  }
   121          mul[i][j] = temp;
   122          }
   123          nanosleep(&t,NULL);
   124          }
   125  }

Complete modified source code can be found here.

Numactl wrapper script :
1  numactl --localalloc --physcpubind=8 ./a.out 1 700 5000000 &
2  numactl --localalloc --physcpubind=8 ./a.out 1 800 7000000 &
3  numactl --localalloc --physcpubind=8 ./a.out 1 900 9000000 &

Arguments :
  • localalloc : Allocate memory on local NUMA node, i.e. from node of core 8 (socket 1).
  • physcpubind : CPU core 8 bind of all processes.
  • Process argument : 1 = single thread, 700 = matrix Size, 5000000 = nano second delay.

"cpufreq_stats" module to get Transition State statistics :
To get the statistics of frequency transition states, we need to load the stats module of cpufreq for benchmark purpose. It is not recommended to keep this module loaded all the time in production system as it uses significant amount of CPU cycles.

Loading of module before executing benchmark :
modprobe cpufreq_stats
Unloading of module after benchmark :
rmmod cpufreq_stats
Stats for CPU core 8 path :

Results on RHEL 6.1 with "cpuspeed" module to control governor:

CPU Transition Latency Impact Graph

Note : Sampling Rate = 10000 usec. Maximum delay introduced in the process of 900 matrix multiplication is 9000000 nsec = 9000 usec, this is less than sampling rate causing core utilization to go down, reducing average CPU utilization for the period of sampling rate, resulting in lowering of frequency. This is true in respect of single process, other process can demand core power at the same instance. So if multiple processes are striving for CPU power then it is total chaos. More details on Sampling Rate are specified below.

 Frequency Transition Counts from the trans_table file in stats :                                 
To 2661000

Observations :
  • Graph corroborates that there is performance increase while using "performance" CPU Frequency Governor. 
  • As you can see we are using single core with "ondemand" & "performance" governor. CPU core 8 is running 3 single-threaded processes with different processing requirements (different matrix sizes) & different simulated I/O delays. 
  • Linux scheduler performs a process context switch depending upon the overlap of processing power required by one of the three processes and delays introduced by them in that point of time. 
  • Intel CPU used by me for this benchmark has above mentioned SpeedStep frequencies. Total of 14 states are supported by the CPU. Depending upon the processing power required by process owning the CPU at that instance, "ondemand" governor reduces/increases frequency in steps. 
  • This introduces the small latency while CPU transitions state. Tweaking of parameters related to transition latency are beyond the scope of this article but I will try to cover most of the important matter here. 
  • It is possible to tweak the related parameters of "ondemand" mode to match best combination for reduced power consumption and desired performance. HPC systems are performance hungry, so it is recommended to keep all cores clocked up to the maximum supported frequency using "performance" governor. 
  • Keeping CPU clocked up at max doesn't necessarily mean that CPU is heating up to threshold level, the voltage level controls the frequency and some instructions don't use much CPU power per tick resulting in low overall utilization. 
  • CPU at maxed out clock with no instruction to execute runs HLT instruction in great proportion to suspend operation in part of CPU's using less energy.
  • According to my observation, if a process needs full CPU core processing power and no context switch is going to happen as single process/thread is latched on to the core and frequency is clocked to max by "ondemand" governor, then we don't not see much of the transition latency impact. 
  • On the other hand, if the process is waiting for I/O to happen or any other event, scheduler context switches to other process and for the time difference between saving previous process stack and loading new demanding process stack governor reduces frequency in steps and increases if new process demands it. This whole sequence happening frequently can cause governor to reduce/increase frequency frequently causing transition latency.
  • In "performance" governor there is no need to reduce/increase frequency as instructions gets processed at the same rate of maximum clock. Process demanding moderate CPU power intermittently can suffer from "ondemand" governor. What I mean to say is completely processing bound job will not suffer from "ondemand" governor, because frequency is maxed out in minimum transitions depending upon utilization in specified time of sampling rate.
"ondemand" governor configuration on my benchmark server :
cpuinfo_transition_latency = 10000 nsec
sampling rate = 10000 usec
up_threshold = 95
  • Transition Latency : Indicates transition time required.
  • Sampling Rate : Kernel checks CPU usage and makes decisions to increase/decrease frequency.
  • Up Threshold : This indicates the average CPU utilization in the time period of sampling rate with current frequency, above this kernel takes decision to increase frequency.
Note : Sampling Rate = Transition Latency * 1000

Monday, August 6, 2012

Infiniband Xmit and Rcv Link Counter Interpretation using Perfquery

Infiniband Interconnect technology is preferred in HPC domain, because of it's ultra low latency (usec), high bandwidth using multiple lanes and RDMA (Remote DMA) capability. Just like Ethernet it has multiple layers and can carry data at SRP (SCSI RDMA) layer as well as IPoIB (IP over Infiniband) layer at the same time. I wanted to calculate the amount of data transferred on 4 IB ports connected to my server. Following is the quick and dirty method to do so for QDR IB ports. Infiniband-diags package on RHEL 6.1 comes with perfquery utility.

Perfquery wrapper script to report the port counters for Mellanox HCA's :
echo "mlx4_0 P 1 - ib0 port - IPoIB Bond Master Interface"
perfquery -C mlx4_0 -P 1
echo "mlx4_0 P 2 - ib1 port"
perfquery -C mlx4_0 -P 2
echo "mlx4_1 P 1 - ib2 port"
perfquery -C mlx4_1 -P 1
echo "mlx4_1 P 2 - ib3 port - IPoIB Bond Slave Interface"
perfquery -C mlx4_1 -P 2

Monitoring the ports for activity :
Grep the desired output and put watch on it, as follows.
watch -n 0.5 './ | grep -E "mlx|XmitData|RcvData|XmitPkts|RcvPkts"' script to clear the port counters :
perfquery -C mlx4_0 -P 1 -R 
perfquery -C mlx4_0 -P 2 -R 
perfquery -C mlx4_1 -P 1 -R 
perfquery -C mlx4_1 -P 2 -R 
echo "Counters Cleared"

How to interpret the actual data transferred on individual links of Infiniband :
  • PortXmitData/PortRcvData shows the actual data Transmitted/Received on the port. Command output of perfquery indicates values which are divided by 4, because of QDR IB links having 4 links to span data chunks across it.This counters excludes link packets. To get actual data transfer on it in MBytes/sec, follow the given procedure. Suppose I did a dd un-buffered (oflag=direct) write to CXFS storage vault through IB network of size 200 MB, the data of 200 MB is spanned across multiple IB ports and committed to storage controller acting as a SRP target. The CXFS vault I am using for testing purpose is a volume made up of multiple LUN Slices. These LUN Slices are having different XVM preferred paths (failover2.conf) owned by distinct storage controllers. Direct disk platter commit is necessary otherwise there are multiple layers of buffers involved in the whole stack to improve performance. Data gets written into RAM file cache first and gets "sync"* to disk. IB HCA's have their own buffers , storage controllers are having their own caches to improve performance. However we don't get actual data transmitted on IB ports if Linux is using it's own file/buffer cache.
             * - "sync" is literal command in Linux to commit all buffered data to disk, system call related to "sync" is fsync() which can be used in programs.

dd command :
dd if=/dev/zero of=/testvol1/testfile.dat bs=10M count=20 oflag=direct

Calculation for QDR :
XmitData on 2 out of 4 ports is 26564832
mlx4_1 P 1 =((26564832*4)/1024)/1024 = 101.33 MBytes
mlx4_1 P 2 =((26583984*4)/1024)/1024 = 101.40 MBytes
  • XmitPkts/RcvPkts shows the total number of packets transferred on the links. Above XmitData/RcvData is encapsulated on top of these IB pkts. Remember that this excludes the link packets.
  • All counters maintained are per Infiniband link (Single X in 1X/4X/12X) so divide/multiply accordingly.
  • The base data rate of of IB is 1X clocked at 2.5 Gbps and is transmitted over two pairs of transmit & receive and yields an effective data rate of 2 Gbps full duplex (2 Gbps transmit,2 Gbps receive).
  • IB 4X, 12X interfaces uses same base clock rate, but uses multiple pairs where each pair is commonly referred as lane.
  • 4X IB gets a signalling rate of 10 Gbps (8 Gbps data rate) using 4 lane = 8 pairs.
InfiniBand Link Signal Pairs Signaling Rate Data Rate Full-Duplex Data Rate
1X 2 2.5 Gbps 2.0 Gbps 4.0 Gbps
4X 8 10 Gbps 8 Gbps 16 Gbps
12X 24 30 Gbps 24 Gbps 48 Gbps

            Note: Although the signaling rate is 2.5 Gbps, the effective data rate is limited to 2 Gbps, due
to the 8B/10B encoding scheme; i.e., (2.5*8)/10 = 2 Gbps.
  • The data shown in the XmitData/RcvData/XmitPkts/RcvPkts shows the data for the all the IB protocols like SRP/IPoIB/SDP(Sockets Direct Protocol)/etc. This totals the amount of data transmitted on IB stack regardless of layered protocols.

Tuesday, June 26, 2012

MPI + OpenMP Thread + CUDA Kernel Thoughts

Random thoughts -
  • Complex jobs require Shared Memory Model & Distributed Memory Model at the same time for optimum performance. 
  • Engineering and related jobs require serious number crunching capabilities, GPU's are useful here.
  • Above block diagram is a possible parallel system architecture. Head node does the job scheduling on various soft-partitioned cluster nodes. Head node administrator can choose scheduling algorithm depending upon the type of jobs running.
  • Standard scheduling algorithms:
    • FIFO (First-In First-Out) - If all the jobs are of equal length and requires same CPU time for completion. Irregular jobs may result in small jobs getting stuck behind larger jobs.
    • RR (Round Robin) - Provides completion of smaller jobs but involves significant amount of context switches.
    • SJF (Shortest Job First) - Prefers I/O bound processes first.
    • SRTF (Shortest Remaining Time First) - Allows short time CPU process to run first by preempting large time process.
    • Priority based - name itself says enough.
    • MLFQ (Multi-level Feedback Queue) - Adjust priority according to burst CPU usage.
    • Linux O(1) & O(n) scheduler.
  • Custom scheduler can be constructed depending upon the job type and keeping hybrid-model in mind. 
  • Big Shared Memory Systems can be soft-partitioned (Cpusets) or affinity-partitioned (Kmp-affinity) to distribute jobs into respective partitions. cc-NUMA systems can take advantage of globally addressable memory between multiple partition and eventually inter-dependent jobs can share their data.
  • For each partitioned system, MPI process can invoke OpenMP threads to run on 32 cores.
  • Parallel construct with lot of test conditions performs well on CPU's rather than GPU's. So multiple partitions can be subscribed for typical job with significant amount of critical sections.
  • Number crunching or String comparison operations where rapid parallel kernel execution is necessary can be offloaded to GPU's.
  • Multiple GPU's can be subscribed from the pool, peer-to-peer memory access between GPU's improves computation/memory communication overlap.
  • Memory can be copied from one partition to another subscribed partition on different nodes using intra-cluster MPI interconnects.
  • Interconnect layer can be of ultra-low latency (microseconds) Infiniband (IB) stack. RDMA (Remote DMA) capabilities of IB network are of great use. MPI can leverage the capability of IB interconnect by using RDMA to send/recv data directly into the application memory without interrupting CPU.
  • OpenMP threads can handle multiple GPU devices based on the requirement.
  • If single job needs to be placed on these systems, partitions can be dissolved and all CPU/GPU resources will be available to single job.
  • Furthermore all GPU's can placed behind a GPU scheduler head node & GPU jobs will be offloaded to GPU head node.Many customization's are possible for this hybrid architecture.
 That's it for now, but there is lot in the pipe for HPC systems coming on my blog so be tuned. Happy threading!!!

Wednesday, May 30, 2012

TOE: TCP Offload Engine on NIC & Packet Capture Misinterpretations

A quick post about TOE (TCP Offload Engine) present these days in about all NIC's. If enabled TCP/IP operations of packets are processed on NIC without interrupting CPU and consuming PCI bandwidth. On Linux systems, TOE can be configured through standard utility Ethtool. Remember, all these parameters are interface specific and process requesting access to network stack to send packet will inherit properties of interface and consequentially knowing about which operation needs to be done by itself and which is going to be offloaded.

To check what is the status of current system supported operations offloaded to NIC use following switch in Ethtool.
ethtool -k ethX
server1 root [tmp] > ethtool -k eth1
Offload parameters for eth11:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
Detailed description of all offloaded operations is out of scope for this post. Refer to online resources for that. I will try to provide one-liner descriptions of important fields to proceed.
  • Scatter-Gather I/O - Rather than passing one large buffer, small buffers are passed which makes up large buffers. This provides more efficiency than large buffers passed.
  • TCP Segmentation Offload - It is the ability to frame data according to size of MTU & same IP header with all packets. Useful when buffer is much larger than MTU on the link. The segmentation into smaller size is offloaded to NIC.
  • Generic Segmentation Offload - This is used to postpone the segmentation as long as possible. This performs the segmentation just before the entry into the driver's xmit routine. GSO & TSO are only significantly effective only when MTU is much less than buffer size.
  • Generic Receive Offload - GSO only works for transmission of packets. This allows the packets to be re-fragmented at output. Unlike LRO which merges every packets, GRO merges with restriction keeping important fields in packet intact. NAPI API polls for new packets and process packets in batches before passing it to OS.
  • Large Receive Offload - This is used for combining multiple incoming packets into single buffer before passing it up to OS stack. Benefits of this is OS sees fewer packets & uses less CPU time.
Depending upon your NIC vendor, names of these processes may vary. Some vendors do provide additional offload processes. My test hardware is having above mentioned features. For the sake of test, I have disabled GRO & LRO. Operation UFO is generally off on all NIC's, reason behind this is UDP packet acknowledgments if they were used are implemented at application layer, CPU needs to be more transparent of all packets & replies. For TSO to work RX,TX & SG are needed to enabled. To enable/disable these operations usage of Ethtool is as follows. Customize it according to requirement. It is to be noted that I kept same TOE configurations on FTP server & client.
ethtool -K ethX rx on/off tx on/off sg on/off tso on/off gso on/off gro on/off lro on/off
I have generated a 4KB random data file using dd utility to transfer through FTP.
dd if=/dev/urandom of=dat.file bs=1k count=4
MTU for the interface under test on server & client is 1500 Bytes. Packet Captures are performed through Tcpdump on client server and later analyzed on Wireshark.
tcpdump -i eth1 -w TOE_test.pcap -n not port 22
4KB file is segmented into chunks of data and multiple packets will flow through link. This TCP segmentation is usually done by CPU in absence of TOE, but if TOE is enabled packets will be encapsulated at OS layer directly as 2920 bytes. Now this is very weird if you don't know about TOE and start wondering how 2920 bytes can be sent over 1500 bytes MTU Ethernet Frame. This is the difference between practical observations & theoretical understanding.  In Screenshot FTP PUT operation from client to server sends 4KB file in two packets, one of them is highlighted and carries 2920 bytes of data followed by packet of remaining bytes. Here TCP stack is in NIC domain. These packets are handed over to TOE of NIC to do segmentation and sequencing of packets. Packet Capture program hook is exactly at the boundary of OS stack, hence we cannot see actual TCP segmentation happening inside NIC. Ack's of data are dependent on size of data. Disabling GSO/TSO results in normal operation of OS TCP/IP stack. Data packets becomes 1460 bytes which is regular size for 1500 bytes MTU links. TCP operations responsibility shifts to OS stack when TOE is disabled.

TCP Segmentation Offload & Generic Segmentation Offload Enabled
TCP Segmentation Offload & Generic Segmentation Offload Disabled
Performance improvements are observed by use of TOE's for servers serving large number of concurrent sessions serving homogeneous large data files. PCI bandwidth is conserved because of less management overhead of TCP segmentation which involves continuous communication between NIC & CPU. I am doing study on the performance implications of TOE & will post soon about it. This TOE behavior is needed to be understood because it also affects IDS/IPS systems like Snort which performs threat signature matching through packet captures. That's it for now. :)

Friday, May 4, 2012

Soft-Partitioning of SMP Machines using Cpuset Enviornment

Cpuset VFS(Virtual File System) Environment provides us a way to assign a set of CPU's and Memory Nodes to a bunch of tasks. This is specially useful for NUMA systems. When I say Node I mean a Physical CPU and its memory bank, so 16 CPU machine has 0-15 Nodes with its own memory bank as local memory and other CPU's memory bank as remote shared memory connected through ccNUMA interconnect.

You will find Taskset to be a very similar utility in Linux kernel to perform processing Core isolation on SMP systems, however the memory allocation policy after using Taskset lies completely with operating system's kernel page allocator module. Linux kernel NUMA allocator will decide where to allocate memory pages for the process running in Taskset environment. Taskset applies to specific program you are running, sometimes this is very time consuming. So I prefer Cpuset over Taskset because you can bind memory in Cpuset environment and best part is Cpuset applies to shell environment and consequently to all processes invoked within that specific shell.

Cpuset is useful if you have large SMP machine with no Virtualization present. It gives you the capability to do efficient Job placement by managing resources. Think of it as a 1980's Virtualization by Isolation technique, just kidding. Cpuset performs its operation by using sched_setaffinity() system call to include CPU's in its CPU affinity mask & using mbind(2), set_mempolicy(2) system calls to tell kernel page allocator to allocate page on specific node.The whole process of creating Cpuset construct can be automated using scripts which can be invoked before starting a specific HPC job. So according to nature of Jobs one can select from a pool of scripts to create a hierarchical construct of Cpusets.

Kernel Cpuset is mounted at /dev/cpuset. Remember this is an Virtual File System. From User Space one can create directory inside this /dev/cpuset directory to create their own Cpusets. The files present inside /dev/cpuset will be reflected inside user created Cpuset right after creation of directory with no values. Cpuset can be marked exclusive which ensures that no other Cpuset (except direct ancestors and descendants) may contain any overlapping CPU's or Memory Nodes. Remember Cpuset masks off the CPU's from the shell process & its descendants which are not specified for a particular Cpuset. Same analogy goes with memory, if Cpuset says to use memory from specific node only, process in that shell bound to user created Cpuset will be allocated memory from respective node & if it isn't sufficient, system will start using swap partition. Furthermore if your swap is full then you are out of luck and process will terminate. So be careful when using Cpusets, plan your memory utilization accordingly and also take NUMA latencies into consideration while allocating page locations for memory.Check status of Cpus_allowed & Mems_allowed at /proc/<<pid>>/status to see currently masked CPU's and memory nodes. If memory spreading is turned off, i.e. memory interleaved mode is turned off then current specified NUMA policy (1) –interleave 2) –preferred 3) –membind 4) --localalloc) applies to memory allocation. Implementation of memory spread is very simple and follows Round Robin algorithm in allocating memory pages on nodes, perhaps this may induce NUMA access latencies if processing node & memory nodes are distant.

By default, in Kernel there is one scheduling domain. One more benefit of firing jobs inside Cpusets is scheduling overheads are less inside Cpusets resulting in less context switches. Cpuset also provides option to enable or disable job scheduling inside Cpuset domain. For Jobs requiring less Cores to work, like 32 Cores, will be greatly benefited from Cpuset construct.These are just few scenarios I have spread light on, configuration and management of Cpusets is huge topic so go explore.

Using Cpusets in SLES on ccNUMA systems -
1) cd /dev/cpuset
2) mkdir svp_env
3) cd svp_env
4) /bin/echo 0-15 > cpus
Use 0-15 Cores for svp_env construct.
5) /bin/echo 0 > mems
Use memory of Node 0 for svp_env construct.
6) /bin/echo 1 > cpu_exclusive
7) /bin/echo 1 > mem_exclusive
8) /bin/echo $$ > tasks
Attach current shell to svp_env construct.
9) cat /proc/self/cpuset
To check current cpuset environment.
10) sh
Fork new shell inside svp_env cpuset shell.
11) cat /proc/self/cpuset
You will find the same cpuset environment of parent shell.
12) you can dynamically add or remove cpus & mems by using same shell.
13) Removing Cpusets - Make sure no shell is attached to Cpuset you want to remove. If present terminate it & use rmdir command on directory inside /dev/cpuset. "rm -rf" will not work on Cpuset vfs directories.

So, enough of this jabber.Now go & restrict evil processes from getting CPU's. :)

Wednesday, April 18, 2012

Transparent Huge Pages Performance Implications

Recently I have observed performance improvement of memory intensive applications on RHEL 6.x systems due to THP (Transparent Huge Pages). THP's are different than Huge Pages. It is enabled by default in RHEL 6.x versions & available in kernel 2.6.38 onwards. Lets clear some basic points
  • Page size on x86 arch = 4kB.
  • Page size on x86_64 arch = 8kB.
  • Huge page size = 2MiB => 512 x 4kB page = 2MiB.
  • Physical page is called as Page Frame.
  • Virtual memory to physical memory address translations are stored into Page Tables.
  • Traversing these Page Tables for memory addresses is called as Page Walk.
  • Page tables can be linked hierarchically to address memory.
  • Every memory access needs to go through Page Walk.
  • Operating System decides about using Normal page or Huge Page for process memory allocation on brk() call or mmap() call.
  • Page walk entries are cached into TLB's (Translational Look-aside Buffers).
  • TLB Cache-Hit or Cache-Miss happens according to Page table size and access pattern associated.
  • TLB cache needs to be flushed with every context switch.
  • Typically TLB can hold upto 8 to 1024 entries in cache.
  • Usage of Huge Pages enables each TLB entry to hold 512 normal pages, increasing the Cache-Hit ratio.
  • Huge Pages may cause fragmentation overhead due to excess data size re-segment (brk() call).
  • cat /proc/meminfo reflects "AnonHugePages" as THP's. THP's are used for allocation of Anonymous memory only as of now.
  • No configuration required for applications to exploit THP's.
Enabling THP :
echo always > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Disabling THP :
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
I have a written a tiny code to stress TLB cache & to perform page walk with normal pages & huge pages. This code creates a single process single thread random operation on arrays utilizing 1GB-50GB of memory.

C code for 1GB memory :
/*Purpose - HugePages Benchmark on RHEL 6.x
Coder - Subodh Pachghare (
Date Modified - 16-4-12
Compilation - gcc -O2 HugePages_test_rhel6_svp.c*/
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <strings.h>

#define access 500000000UL
int gar1[access];
clock_t time_start;

int main()
  //Modify size variable to change memory utilization of code
size_t size1 = 1024UL*1024*1024*1;
time_start = clock();
char *p = (char*)malloc(size1);
  //Zeros p
bzero(p, size1);
printf("[+]mem_alloc: %f sec\n",((double)clock() - time_start / CLOCKS_PER_SEC));
long i;
for (i = 0; i < access; i++)
    gar1[i] = random() % size1;

time_start = clock();
  //Each array access will need Page Table lookup
for (i = 0; i < access; i++) 
    p[gar1[i]] = 2 * gar1[i];
printf("[+]random_op: %f sec\n",((double)clock() - time_start / CLOCKS_PER_SEC));

Performance Graph :
THP Performance of Code
  • Code reports collective time of memory allocation (malloc() call) & memory touch operation (bzero() call) in first phase.
  • Second phase time reported by code is math operation performed on arrays.
  • Each access to array will go through TLB to do a Page Walk.
  • From graph, memory allocation & memory touch operation improves with THP's enabled. Performance improvement are more for higher memory usages.
  • Execution operation shows significant performance improvement with THP's enabled. This indicated less TLB cache-miss while traversing through page table doing array operation.
  • As I mentioned earlier, this is single process operation doing memory access on shared memory system with ccNUMA architecture. Results may vary for more number of process/threads depending upon access pattern. 
THP's For Parallels Virtuozzo Users :
  • THP's in Base node are made available to all containers automatically. There is no need for any container specific configuration. THP's in Virtuozzo are container template agnostic, i.e. RHEL 6.x base node can provide THP's to SLES template based containers.
My Take - There is no point in disabling this feature if you got RHEL 6.x running with memory intensive application. It will definitely give you performance without any modification in application.

Sunday, March 11, 2012

Analysis of OpenMP Schedule Clause in Parallel Matrix Multiplication

In this post, I will describe some findings on my first encounter with Parallel Programming using OpenMP 3.1 API. 
I discuss here the simple usage of OpenMP API to speedup the operation which otherwise would be single threaded taking loads of time to complete. Let's discuss few basics of OpenMP before proceeding to Matrix Multiplication.

OpenMP Thread Model

Few points regarding OpenMP:
  • OpenMP is designed keeping in mind the perfect UMA (Uniform Memory Architecture) system, where all memory access latencies are the same. However, a NUMA (Non-uniform Memory Architecture) systems has variable latencies depending upon the distance and design topology used. I have used NUMA system here for benchmarking.
    • Detailed discussion of NUMA is beyond scope for this post. You can refer to my earlier post
                    Shared Memory Performance in cc-NUMA Systems with Libnuma API
  • Thread placement can be controlled using various CPU affinity features of OS as well as OpenMP; in Linux NUMA systems threads are placed on CPU's closer to allocated memory region to avoid high memory access latencies. This is dynamically controlled by Linux kernel unless specified by user. OpenMP 4.x version will have better support for NUMA systems.
  • Code can contain both serial and parallel statements, all the ugly syntax of threading is hidden behind the OpenMP pragma directives.
  • OpenMP is best for loop level parallelism, without much critical sections.
  • Threads will be forked according to the pragma directives and joined after completion. Using OpenMP we can construct multiple parallel regions within the same code. Multiple threads can be forked on occurrence of every parallel region construct.
  • OpenMP allows us to fork threads once and then distribute nested parallel regions on these threads  reducing the overhead of thread creation on every parallel construct. Operations such as flush() helps us to maintain integrity of data between nested threads.
  • OpenMP usually distributes 1 thread per core. More threads per core can cause decrease in performance by stealing CPU time, however this behavior depends upon the application and system configurations.
  • Thread creation involves certain overheads which must be taken into consideration while writing scalable parallel code.
Advantages of OpenMP:
  •  Easy to program. More readable code.
  •  Data layout is  handled automatically by OpenMP directives.
  •  Threads can move from one core to other, unless specifically bound.
  •  Easy modification of environment settings.
  •  Guaranteed to perform efficiently on Shared Memory Systems.
  •  Portable and widely available.
Disadvantages of OpenMP:
  • Performance may decrease after finite number of threads, as all threads will compete for shared memory bandwidth. To reduce this threads must be placed properly on nearer CPU's depending upon the application requirement.
  • Thread Synchronization is expensive.
  • Available on SMP (Symmetric Multiprocessing) systems only.
  • Not as scalable as MPI (Message Passing Interface), although Hybrid model of MPI + OpenMP + OpenMP CUDA is getting a lot of attention. We will discuss it in later posts.

Parallel Matrix Multiplication

As part of learning OpenMP I have written a code for Parallel Matrix Multiplication. Algorithm used here for Matrix Multiplication is conventional one (We all learned it in school). Lots of optimized Matrix Multiplication algorithms exists, but lets leave that to other post.

2-Thread Parallel Matrix Multiplication Flow

System/Software Information:
  1. System - SGI Altix UV1000 (128 core 16 CPU Node 1 TB RAM Fat-Tree model cc-NUMA)
  2. OS - RHEL 6.1 x86_64
  3. GCC version - Red Hat 4.4.5-6
C Code Snippet:
Complete code can be downloaded here -
    /*Assigning matrix elements with static data, conditional pragma directive*/
printf("Assigning array elements with '1' for multiply op...\n");
#pragma omp parallel for private(i,j) if(mat_size>2000)

    /*Parallel Matrix Multiplication operation starts*/
printf("Matrix multiplication with openmp...\n");
printf("Chunk Size = %d\n",chunk);
fprintf(f1,"Chunk Size = %d Matrix size = %d Threads = %d\n",chunk,mat_size,threads);
    /*OpenMP time function to calculate real time required by operation*/
start_time = omp_get_wtime();
    /*Parallel construct with shared variables & private variable throughout the region*/
#pragma omp parallel shared(mul,m1,m2) private(threads_id)
    /*Report thread number*/
threads_id = omp_get_thread_num();
if (threads_id == 0)
    /*Master thread will report total number of threads invoked*/
    tot_threads = omp_get_num_threads();
    printf("Total worker threads invoked = %d\n",tot_threads);
    /*Parallel for loop directive with dynamic, chunk size schedule policy.
      Private variables for parallel loop construct.
      schedule options:
      1) schedule (static)
      2) schedule (static, chunk)
      3) schedule (dynamic)
      4) schedule (dynamic,chunk)
#pragma omp for schedule (dynamic, chunk) nowait private(i,j,k,temp)
    /*Outer loop Row parallelization*/
    /*printf("Thread=%d row=%d completed\n",threads_id,i);*/
           temp = 0;
           mul[i][j] = temp;
end_time = omp_get_wtime();
printf("[+]multiply_op_omp_time %f sec\n", end_time-start_time);

Code Compilation and Output:
I have compiled the code with GCC Optimization level 2. Sample code output shown below is produced for 128 threads, 3000 matrix size  chunk size 1.
server1 [root]> gcc -O2 omp_mmul_svp.c -o omp_mmul -fopenmp
server1 [root]> ./omp_mmul 128 3000 1
Multiply Array memory allocation...
Assigning array elements with '1' for multiply op...
Matrix multiplication with openmp...
Chunk Size = 1
Total worker threads invoked = 128
[+]multiply_op_omp_time 4.379983 sec
Single thread verification complete...OK

Thread Monitoring:
To monitor threads on system, start Top utility with Threads toggle option enabled.
top -H
Threads are part of same process. Using P column in Top we can determine the core used for processing of thread.

Code Information:
  • Frequently used shared and thread private variables are initialized as register variables, to suggest compiler to keep them in CPU registers for faster access.
  • Number of worker threads to be invoked is provided through omp_set_num_threads() function. If not specified OpenMP will invoke 1 thread for each CPU core in the system.
  • 2D array is initialized followed by memory allocation using malloc() function
  • Now it is necessary to analyze the result of matrix multiplication because it could contain erroneous results due to race conditions which are common in parallel programming. So I choose static data for matrix elements to verify the results easily using serial code. Here I have assigned  m1 and m2 matrices with '1' as element value, so that each element in mul matrix would end up as same value given by user as matrix size. For ex, If matrix size is 3000, then mul[0][0] would be 3000 if operation is error-free.
  • Now assigning value to elements of m1 and m2 matrices will take more time especially if matrix size is large, so to circumvent this I have used conditional pragma directive here. If matrix size provided is more than 2000 for loop will become parallel for loop. Please note that number of threads invoked will depend upon value given by user. Placement of omp_set_num_threads() call statement in code is very important.
  • Parallel assigning of values to m1 and m2 matrices improves performance because the assignment operator is independent of each other. Never use function calls in parallel unless they are parallel too.
  • Operation time calculated is only for multiplication section of code. OpenMP provides function omp_get_wtime() to report the current time. Difference between start time and end time is the time required to perform the parallel multiplication operation.
  • Parallel region with global shared variables for matrix data and global private variable for thread number is followed by parallel for region. Pragma for statement applies only to one loop, nested for loop parallelism is achieved using multiple pragma omp directives. By default in pragma for directive 'i' variable will become private, though I explicitly declared them to be private variables to make code more understandable. Default type in pragma omp parallel construct is shared unless specified, you can change it by using default(private) in pragma omp parallel statement.
  • Nowait clause in pragma for statement will remove the implied barrier at end of parallel region and continues to execute next statement in pragma omp parallel region. Threads that complete their jobs early can proceed straight to next statements in worksharing constructs without waiting for other threads to finish. This is typically used if more than one pragma for loops are present in parallel region. Combination of schedule clause (work load distribution) and nowait clause must be adjusted to optimize the code.
  • Schedule clause divides iteration of pragma for loops into contiguous subsets known as chunk and distributes them on threads as per the types. These options can be hard-coded into code or can be specified from environment settings at run time. Here I am comparing two types of schedule clauses namely static and dynamic.
    • static - Iterations are equally divided and statically distributed among threads. Chunks are calculated automatically and assigned to threads in round-robin fashion depending on order of thread number.
    • static, chunk_size - Iterations are divided into N chunks specified and statically distributed among threads. Distribution remains same as in static, only chunk size is specified here.
    • dynamic -  Iterations are divided according to default chunk size 1 and dynamically distributed among threads. Iteration chunks are distributed to threads as they request them. Threads process the chunks and then requests for new chunks until no chunks are remaining. Default chunk_size is 1, so each thread will be given 1 iteration.
    • dynamic, chunk_size - Iterations are divided into N chunks specified and dynamically distributed among threads. Distribution policy remains same as in dynamic, only difference is each thread will be given N iterations to process. Efficient load balancing is achieved using dynamic mode. Dynamic mode also adds processing overhead.
  • There are more types in schedule clause which are not discussed here such as guided, auto and runtime refer to OpenMP API guide for details.
  • Depending on your code, I suggest to use Gprof pro-filer to check for unexpected wait states in parallel code. I have explained detailed usage of Gprof in my earlier post 
          Code Profiling in Linux Using Gprof
Thread vs Chunk Size Performance graph:

  • As you can see in Thread vs Chunk Size performance graph, single thread process gives different execution time for different chunk sizes. As you can see 1 thread on single core will compete for CPU time, hence one time slot performs operation on N loops. Higher chunk size indicates more performance on single thread with less overhead of dynamic schedule allocation. Single thread operation of OpenMP code is slower than plain serial code due to overheads involved.
  • For 2 thread operation, chunk size 2 takes more time to complete than all other chunk sizes. Chunk size 1 provides maximum performance.
  • For 4 thread operation, chunk size 4 yields maximum performance with approx. 188 chunks to be processed by each thread.
  • If we go on increasing number of threads to maximum cores present in system i.e. 128, the performance difference between different chunk sizes vanishes. There is slight difference between chunk size performance with 128 cores, this is because of the overheads of dynamic schedule option requesting chunks to process.
  • In dynamic mode, threads are spawned slowly as each thread requests chunk of data to process. Since there is a possibility of job completion before spawning of all specified threads, it is necessary to choose chunk size and number of threads wisely if system is used for multiple parallel programs.
Dynamic vs Static Schedule Performance graph:

  • This benchmark is performed with no chunk size specified to Schedule clause. For static type, the chunk size will be calculated automatically with equal distribution in mind. For Dynamic type, of chunk is not specified it is 1. So each thread will be provided with 1 iteration of for loop to process, after completion it can request new iteration to process. In NUMA system, iterations being processes closer to allocated memory location will complete faster as compared to distant iteration processing nodes due to memory access latencies.
  • Static will load all specified threads immediately with calculated chunk sizes.
  • Upto 16 thread there is significant difference between performance of dynamic and static schedule, after that difference vanishes.
  • Use of dynamic scheduling increases performance for lower number of threads.
  • OpenMP behavior is dynamic, tune parameters according to underlying system to get maximum performance.
  • Idling threads are of no use at all; try to load threads equally so that they should complete their job at same time.
  • Use OpenMP only with sufficient workload.
  • Avoid fork/join of threads at every parallel construct, reuse invoked threads.
  • Play with code and observe performance to match best results.
  • Parallel code is very hard to debug as it silently produces wrong results. Verification is mandatory for error-free parallel code.
This is my first code on OpenMP platform and I am very excited to learn more about OpenMP. Happy Parallelism!

Sunday, February 19, 2012

Shared Memory Performance in cc-NUMA Systems with Libnuma API

NUMA System Architecture
Recently I got involved into High Performance Computing business. I studied cc-NUMA (Cache Coherent Non Uniform Memory Access) Shared Memory model as a part of it. In Shared Memory model, all memory in cc-NUMA system is globally addressable unlike  Distributed Memory model in MPI (Message Passing Interface) or generally known as Cluster model. CPU's in cc-NUMA systems are interconnected through Cache Coherent interconnects with topology like 2D-torus, Fat-Tree, 3D-Hypercube,etc. detailed discussion of all these is out of scope for this post. The typical system I have studied has 16 8-core CPU's with 64GB RAM/CPU ratio and Interconnect topology as Fat-Tree which I will be discussing shortly. From now on I will refer Node0 to CPU0 and NUMA to cc-NUMA.

Cache Coherency - CPU's in NUMA systems are capable of using remote memory connected to other CPU's, so parallel processes/threads can run on multiple nodes modifying same data stored into one typical node memory. Inconsistent data present across memory locations can cause program to generate false output(Race conditions). NUMA interconnects takes care of memory consistencies using various CC protocols. Simultaneous read/writes by threads on data in cache causes consistency overhead, programmers must take into consideration the NUMA structure while writing the code. Remember that code needs to written for NUMA specific system without taking portability into account.There are two approaches to lessen Coherency overheads.
  1. Place data nearer to thread processing nodes.
  2. Craft thread processing node policies such that nodes nearer to data will process it, sort of Affinity-like structure.
In NUMA systems distant memory access is comparatively slower than local memory access, but quite less slower than typical MPI architectures. Choosing between Shared Memory & MPI depends on applications you are going to use. Shared Memory/NUMA systems can run applications which are agnostic to underlying architecture, reason for this is NUMA memory allocation policies are controlled by Operating System layer by default. Linux kernel is NUMA aware and employs Libnuma libraries for programmers to design NUMA aware codes. Linux kernel also incorporates numactl, numastat commands to confine the CPU & Memory nodes for given programs and provide NUMA statistics. NUMA SLIT distance matrix can be seen using numactl command which I will discuss at end of this post. MPI applications need to be MPI aware which requires modification of code.

Fat-Tree Model - 16 Nodes with 8 cores each are connected in Fat-Tree topology with each other. This is implemented in back-plane of system with interior switch nodes in upper layers. This internal NUMA can be linked with other internal NUMA using NUMA Routers making it inter-system NUMA supercomputer. SGI uses Fat-tree topology model for SGI UV1000 with more links in upper branches.

If node connected on lowest point in the tree wants to communicate to the memory of lowest point in different branch of tree, it will block the upper line until its operation is over. To overcome this problem, as you go up in tree the lines or interconnects gets thicker providing more paths for simultaneous communication. Global communication will flow through higher switches with more interconnects links. Interconnects will increase as we approach root switch. Lowest ends or leaves in tree are nodes. Local communication of nodes on same branch will happen through local switches. HPC vendors have modified fat-tree topology to suit there structure. Hops depend on layer of switches used by HPC vendor.Fat-tree model formulas are as follows

N level of switches - 4 layer in our case
2(pow N) No. of Nodes - 16 Nodes
2(pow N) - 1 No. of Switches - 15 Switches
1 connection per Node
2N Greatest distance between Nodes - 8 hops

NUMA Performance benchmark Code using Linux NUMA API in C

Test Structure - We all know C is 'Metal Language', best to deal with system level stuff so I decided to use C. I have used CPU0 Core0 i.e. 0 processor as a reference CPU to perform all operations. Initially source array is placed on source memory bound to local memory of CPU0, then local memory of CPU1 and finally local memory of CPU2.

General overview of test performed on NUMA system

C Code
/*Subodh Pachghare*/
/*compile with "gcc ... -lnuma"*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <numa.h>
#include <numaif.h>
#include <numacompat1.h>
/*Global Variable*/
int src[10000000];
int dst[10000000];
int counter=0;
clock_t time_start;
char *dat_file = "dat_file";
int main(int argc, char *argv[])
if (4 != argc)
    fprintf (stderr, "Usage: %s <NUMA CPU Node> <NUMA SRC Mem Node> <NUMA DST Mem Node>\n", argv[0]);
FILE *f1;
f1 = fopen(dat_file, "a");
int cpu_node_int,src_memo_node_int,dst_memo_node_int=0;
if(numa_available() < 0) 
    printf("Your system does not support NUMA API\n");
printf("NUMA Nodes available = %d\n",(int)numa_max_node()+1);
/*char to int conversion for bitmask*/
cpu_node_int = atoi(argv[1]);
src_memo_node_int = atoi(argv[2]);
dst_memo_node_int = atoi(argv[3]);
printf("NUMA Distance CPU & SRC_mem = %d Units\n",numa_distance(cpu_node_int,src_memo_node_int));
printf("NUMA Distance CPU & DST_mem = %d Units\n",numa_distance(cpu_node_int,dst_memo_node_int));
printf("NUMA Distance SRC_mem & DST_mem = %d Units\n",numa_distance(src_memo_node_int,dst_memo_node_int));
/*NUMA Node CPU Bind structure*/
nodemask_t cpu_bind;
/*printf("CPU Bitmask value = %u\n",cpu_bind);*/
/*use "numa_bind(&cpu_bind)" in case of node restriction policy without mem_bind()*/
/*NUMA Node SRC Memory Bind structure*/
nodemask_t src_memo_bind;
printf("SRC Memory Bitmask value = %u\n",src_memo_bind);
/*Random Function to use CPU Cache*/
time_start = clock();
/*printf("src value in 0 = %d\n",src[9999999]);*/
printf("[+]random_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));
fprintf(f1,"[+]random_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));
/*NUMA Node DST Memory Bind structure*/
nodemask_t dst_memo_bind;
printf("DST Memory Bitmask value = %u\n",dst_memo_bind);
/*Memcpy function for Memory*/
time_start = clock();
printf("[+]memcpy_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));
fprintf(f1,"[+]memcpy_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));

  • Nodemask structure is provided by Suse Linux NUMA API to set nodes for memory & CPU bind.
  • First structure binds process to given CPU using numa_run_on_node_mask(struct bitmask *nodemask) function.
  • Second & Third structure binds source and destination memory to given CPU node using numa_set_membind(struct bitmask *nodemask) function.
  • Source array touched on src_mem by assigning random values from rand() function.
  • 4 memcpy() calls performed on src and dst array per loop.
  • CPU time spent in each block is calculated using clock() function. Difference divided by CLOCKS_PER_SEC (OS/Hardware specific value) to get CPU period for process.

Memcpy graph -

Gnuplot code for Memcpy graph -
set datafile separator ","
set grid x y2
set grid y y2
set xtics (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
set ytics auto
set title 'NUMA Memcpy Performance - Fat Tree'
set ylabel 'Execution Time (Sec)'
set xlabel 'DST_mem NUMA Nodes'
set style data linespoints
set arrow 1 from 8,43 to 8,46
set arrow 2 from 9,43 to 9,46
set label 1 'Max Time CPU0 SRC_mem2 DST_mem8,9' at 6.8,41
set terminal png
set output "Memcpy_perf.png"
plot 'src_mem_0.csv','src_mem_1.csv','src_mem_2.csv'

SLIT (System Locality Information Table) values -
Use following command on NUMA enabled Linux system to find out SLIT values provided by BIOS
numactl --hardware

NUMA SLIT Matrix on 16 Nodes system used for testing -
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  0:  10  13  40  40  62  62  55  55  62  62  55  55  62  62  55  55
  1:  13  10  40  40  62  62  55  55  62  62  55  55  62  62  55  55
  2:  40  40  10  13  55  55  48  48  55  55  48  48  55  55  48  48
  3:  40  40  13  10  55  55  48  48  55  55  48  48  55  55  48  48
  4:  62  62  55  55  10  13  40  40  62  62  55  55  62  62  55  55
  5:  62  62  55  55  13  10  40  40  62  62  55  55  62  62  55  55
  6:  55  55  48  48  40  40  10  13  55  55  48  48  55  55  48  48
  7:  55  55  48  48  40  40  13  10  55  55  48  48  55  55  48  48
  8:  62  62  55  55  62  62  55  55  10  13  40  40  62  62  55  55
  9:  62  62  55  55  62  62  55  55  13  10  40  40  62  62  55  55
 10:  55  55  48  48  55  55  48  48  40  40  10  13  55  55  48  48
 11:  55  55  48  48  55  55  48  48  40  40  13  10  55  55  48  48
 12:  62  62  55  55  62  62  55  55  62  62  55  55  10  13  40  40
 13:  62  62  55  55  62  62  55  55  62  62  55  55  13  10  40  40
 14:  55  55  48  48  55  55  48  48  55  55  48  48  40  40  10  13
 15:  55  55  48  48  55  55  48  48  55  55  48  48  40  40  13  10

Values present in table are relative latencies.Diagonal values remain constant. Longest distance node are with maximum value of 62, for ex. Node 0 is distant to node 4,5,8,9,12 & 13. Reference value is 10 which is required to access local memory of Node. According to table it can be clearly seen that access to memory of Node 1 from Node 0 will have 30% more latency as compared to local memory access of Node 0.

Linux NUMA Memory Allocation policies -
NUMA aware Linux creates logical tree structure of Memory Nodes from SLIT tables and decides memory allocation policy accordingly. Default policy is to assign the local memory of Node for the operation as least latency is present in local memory access, if local is running low on memory then next Node memory can be used with somewhat more latency and so on, however Nodes with maximum latencies will be the last option. Similarly processes can be migrated to nodes on which enough local memory is present resulting in low access latencies. Linux does this load balancing internally, however for some special application-specific conditions user can provide input to NUMA policy by using numactl command. Some applications work perfectly out of the box on NUMA system with improved performance while some needs tuning using numactl parameters or source modification. Some of the options available for memory allocation policy are as follows
1. Interleave - Memory will be allocated on all nodes using round-robin fashion.
2. Localalloc - Always allocate memory on local node.
3. Preferred - Allocate memory on given node & fallback to other node if not available.

That's it for this post guys. I am working on OpenMP with NUMA systems and I will post about it, till then happy parallel computing.

Update 19-6-2012:
Query from reader -
"In your code, you have two global arrays src and dst. As I know, global variables are allocated by the compiler; and before the program executes, these two arrays are already allocated in the global data area. However, later in the code numa_set_membind(nodemast_t) is called and the src array is assumed in the node src_memo_node_int. I am really confused why the global arrays can be resided in different node according to src_memo_node_int when the compiler allocates them. So it would be appreciated if you could explain it for me."

Reply -
"There is a first touch policy in place, first CPU or to be more precise first Core accessing the page will be responsible for faulting the page in and assigning it on local CPU node memory. For Ex. If Core 9 touches memory first then page will be allocated on node memory of CPU socket having Core 9.There is a difference between assigning memory & allocating memory.
Memory gets allocated in node at the time of first access, which in my case is src array initialization with random value & dst array initialization with null value using memcpy call. Uninitialized global variables are termed as BSS(Block Started by Symbol) in process memory context.Use size command before executable and observe the BSS size with/without global variables. At the time when memory is being allocated on a Linux system no space in the physical memory is needed. Physical memory will be allocated when memory is accessed for the first time. This touching of memory has to be implemented in code running on Shared Memory Systems. Touches = Writes/Reads. There is round-robin policy in place instead of first touch, but all major SMS vendors like SGI provide First Touch policy by default.

Ex. A global variable “int z” in the program, what happens when the process/thread of respective core set by nodemask() tries to access it?
● Page fault occurs
● OS looks at the page and realizes it corresponds to a zero page
● Allocates a new physical frame in memory set by membind() node policy and sets all bytes to zero
● Maps the frame into the address space"