Sunday, February 19, 2012

Shared Memory Performance in cc-NUMA Systems with Libnuma API

NUMA System Architecture
Recently I got involved into High Performance Computing business. I studied cc-NUMA (Cache Coherent Non Uniform Memory Access) Shared Memory model as a part of it. In Shared Memory model, all memory in cc-NUMA system is globally addressable unlike  Distributed Memory model in MPI (Message Passing Interface) or generally known as Cluster model. CPU's in cc-NUMA systems are interconnected through Cache Coherent interconnects with topology like 2D-torus, Fat-Tree, 3D-Hypercube,etc. detailed discussion of all these is out of scope for this post. The typical system I have studied has 16 8-core CPU's with 64GB RAM/CPU ratio and Interconnect topology as Fat-Tree which I will be discussing shortly. From now on I will refer Node0 to CPU0 and NUMA to cc-NUMA.

Cache Coherency - CPU's in NUMA systems are capable of using remote memory connected to other CPU's, so parallel processes/threads can run on multiple nodes modifying same data stored into one typical node memory. Inconsistent data present across memory locations can cause program to generate false output(Race conditions). NUMA interconnects takes care of memory consistencies using various CC protocols. Simultaneous read/writes by threads on data in cache causes consistency overhead, programmers must take into consideration the NUMA structure while writing the code. Remember that code needs to written for NUMA specific system without taking portability into account.There are two approaches to lessen Coherency overheads.
  1. Place data nearer to thread processing nodes.
  2. Craft thread processing node policies such that nodes nearer to data will process it, sort of Affinity-like structure.
In NUMA systems distant memory access is comparatively slower than local memory access, but quite less slower than typical MPI architectures. Choosing between Shared Memory & MPI depends on applications you are going to use. Shared Memory/NUMA systems can run applications which are agnostic to underlying architecture, reason for this is NUMA memory allocation policies are controlled by Operating System layer by default. Linux kernel is NUMA aware and employs Libnuma libraries for programmers to design NUMA aware codes. Linux kernel also incorporates numactl, numastat commands to confine the CPU & Memory nodes for given programs and provide NUMA statistics. NUMA SLIT distance matrix can be seen using numactl command which I will discuss at end of this post. MPI applications need to be MPI aware which requires modification of code.

Fat-Tree Model - 16 Nodes with 8 cores each are connected in Fat-Tree topology with each other. This is implemented in back-plane of system with interior switch nodes in upper layers. This internal NUMA can be linked with other internal NUMA using NUMA Routers making it inter-system NUMA supercomputer. SGI uses Fat-tree topology model for SGI UV1000 with more links in upper branches.


If node connected on lowest point in the tree wants to communicate to the memory of lowest point in different branch of tree, it will block the upper line until its operation is over. To overcome this problem, as you go up in tree the lines or interconnects gets thicker providing more paths for simultaneous communication. Global communication will flow through higher switches with more interconnects links. Interconnects will increase as we approach root switch. Lowest ends or leaves in tree are nodes. Local communication of nodes on same branch will happen through local switches. HPC vendors have modified fat-tree topology to suit there structure. Hops depend on layer of switches used by HPC vendor.Fat-tree model formulas are as follows

N level of switches - 4 layer in our case
2(pow N) No. of Nodes - 16 Nodes
2(pow N) - 1 No. of Switches - 15 Switches
1 connection per Node
2N Greatest distance between Nodes - 8 hops

NUMA Performance benchmark Code using Linux NUMA API in C

Test Structure - We all know C is 'Metal Language', best to deal with system level stuff so I decided to use C. I have used CPU0 Core0 i.e. 0 processor as a reference CPU to perform all operations. Initially source array is placed on source memory bound to local memory of CPU0, then local memory of CPU1 and finally local memory of CPU2.

General overview of test performed on NUMA system

C Code
/*Subodh Pachghare*/
/*compile with "gcc ... -lnuma"*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <numa.h>
#include <numaif.h>
#include <numacompat1.h>
/*Global Variable*/
int src[10000000];
int dst[10000000];
int counter=0;
clock_t time_start;
char *dat_file = "dat_file";
int main(int argc, char *argv[])
{
if (4 != argc)
    {
    fprintf (stderr, "Usage: %s <NUMA CPU Node> <NUMA SRC Mem Node> <NUMA DST Mem Node>\n", argv[0]);
    exit(1);
    }
FILE *f1;
f1 = fopen(dat_file, "a");
int cpu_node_int,src_memo_node_int,dst_memo_node_int=0;
if(numa_available() < 0) 
    {
    printf("Your system does not support NUMA API\n");
    exit(1);
    }    
printf("NUMA Nodes available = %d\n",(int)numa_max_node()+1);
/*char to int conversion for bitmask*/
cpu_node_int = atoi(argv[1]);
src_memo_node_int = atoi(argv[2]);
dst_memo_node_int = atoi(argv[3]);
printf("NUMA Distance CPU & SRC_mem = %d Units\n",numa_distance(cpu_node_int,src_memo_node_int));
printf("NUMA Distance CPU & DST_mem = %d Units\n",numa_distance(cpu_node_int,dst_memo_node_int));
printf("NUMA Distance SRC_mem & DST_mem = %d Units\n",numa_distance(src_memo_node_int,dst_memo_node_int));
/*NUMA Node CPU Bind structure*/
nodemask_t cpu_bind;
nodemask_zero(&cpu_bind);
nodemask_set(&cpu_bind,cpu_node_int);
numa_run_on_node_mask(&cpu_bind);
/*printf("CPU Bitmask value = %u\n",cpu_bind);*/
/*use "numa_bind(&cpu_bind)" in case of node restriction policy without mem_bind()*/
/*NUMA Node SRC Memory Bind structure*/
nodemask_t src_memo_bind;
nodemask_zero(&src_memo_bind);
nodemask_set(&src_memo_bind,src_memo_node_int);
numa_set_membind(&src_memo_bind);
printf("SRC Memory Bitmask value = %u\n",src_memo_bind);
/*Random Function to use CPU Cache*/
time_start = clock();
for(counter=0;counter<10000000;counter++)
    {
    src[counter]=rand();
    /*printf("%d\n",src[counter]);*/
    }
/*printf("src value in 0 = %d\n",src[9999999]);*/
printf("[+]random_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));
fprintf(f1,"[+]random_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));
/*NUMA Node DST Memory Bind structure*/
nodemask_t dst_memo_bind;
nodemask_zero(&dst_memo_bind);
nodemask_set(&dst_memo_bind,dst_memo_node_int);
numa_set_membind(&dst_memo_bind);
printf("DST Memory Bitmask value = %u\n",dst_memo_bind);
/*Memcpy function for Memory*/
time_start = clock();
for(counter=0;counter<200;counter++)
    {
    memcpy(dst,"",1);
    memcpy(dst,src,sizeof(src)+1);
    memcpy(src,"",1);
    memcpy(src,dst,sizeof(dst)+1);
    }
printf("[+]memcpy_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));
fprintf(f1,"[+]memcpy_op: %f\n",((double)clock() - time_start / CLOCKS_PER_SEC));
fcloseall();
exit(0);
}

  • Nodemask structure is provided by Suse Linux NUMA API to set nodes for memory & CPU bind.
  • First structure binds process to given CPU using numa_run_on_node_mask(struct bitmask *nodemask) function.
  • Second & Third structure binds source and destination memory to given CPU node using numa_set_membind(struct bitmask *nodemask) function.
  • Source array touched on src_mem by assigning random values from rand() function.
  • 4 memcpy() calls performed on src and dst array per loop.
  • CPU time spent in each block is calculated using clock() function. Difference divided by CLOCKS_PER_SEC (OS/Hardware specific value) to get CPU period for process.

Memcpy graph -


Gnuplot code for Memcpy graph -
set datafile separator ","
set grid x y2
set grid y y2
set xtics (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
set ytics auto
set title 'NUMA Memcpy Performance - Fat Tree'
set ylabel 'Execution Time (Sec)'
set xlabel 'DST_mem NUMA Nodes'
set style data linespoints
set arrow 1 from 8,43 to 8,46
set arrow 2 from 9,43 to 9,46
set label 1 'Max Time CPU0 SRC_mem2 DST_mem8,9' at 6.8,41
set terminal png
set output "Memcpy_perf.png"
plot 'src_mem_0.csv','src_mem_1.csv','src_mem_2.csv'

SLIT (System Locality Information Table) values -
Use following command on NUMA enabled Linux system to find out SLIT values provided by BIOS
numactl --hardware

NUMA SLIT Matrix on 16 Nodes system used for testing -
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  0:  10  13  40  40  62  62  55  55  62  62  55  55  62  62  55  55
  1:  13  10  40  40  62  62  55  55  62  62  55  55  62  62  55  55
  2:  40  40  10  13  55  55  48  48  55  55  48  48  55  55  48  48
  3:  40  40  13  10  55  55  48  48  55  55  48  48  55  55  48  48
  4:  62  62  55  55  10  13  40  40  62  62  55  55  62  62  55  55
  5:  62  62  55  55  13  10  40  40  62  62  55  55  62  62  55  55
  6:  55  55  48  48  40  40  10  13  55  55  48  48  55  55  48  48
  7:  55  55  48  48  40  40  13  10  55  55  48  48  55  55  48  48
  8:  62  62  55  55  62  62  55  55  10  13  40  40  62  62  55  55
  9:  62  62  55  55  62  62  55  55  13  10  40  40  62  62  55  55
 10:  55  55  48  48  55  55  48  48  40  40  10  13  55  55  48  48
 11:  55  55  48  48  55  55  48  48  40  40  13  10  55  55  48  48
 12:  62  62  55  55  62  62  55  55  62  62  55  55  10  13  40  40
 13:  62  62  55  55  62  62  55  55  62  62  55  55  13  10  40  40
 14:  55  55  48  48  55  55  48  48  55  55  48  48  40  40  10  13
 15:  55  55  48  48  55  55  48  48  55  55  48  48  40  40  13  10

Values present in table are relative latencies.Diagonal values remain constant. Longest distance node are with maximum value of 62, for ex. Node 0 is distant to node 4,5,8,9,12 & 13. Reference value is 10 which is required to access local memory of Node. According to table it can be clearly seen that access to memory of Node 1 from Node 0 will have 30% more latency as compared to local memory access of Node 0.

Linux NUMA Memory Allocation policies -
NUMA aware Linux creates logical tree structure of Memory Nodes from SLIT tables and decides memory allocation policy accordingly. Default policy is to assign the local memory of Node for the operation as least latency is present in local memory access, if local is running low on memory then next Node memory can be used with somewhat more latency and so on, however Nodes with maximum latencies will be the last option. Similarly processes can be migrated to nodes on which enough local memory is present resulting in low access latencies. Linux does this load balancing internally, however for some special application-specific conditions user can provide input to NUMA policy by using numactl command. Some applications work perfectly out of the box on NUMA system with improved performance while some needs tuning using numactl parameters or source modification. Some of the options available for memory allocation policy are as follows
1. Interleave - Memory will be allocated on all nodes using round-robin fashion.
2. Localalloc - Always allocate memory on local node.
3. Preferred - Allocate memory on given node & fallback to other node if not available.

That's it for this post guys. I am working on OpenMP with NUMA systems and I will post about it, till then happy parallel computing.

Update 19-6-2012:
Query from reader -
"In your code, you have two global arrays src and dst. As I know, global variables are allocated by the compiler; and before the program executes, these two arrays are already allocated in the global data area. However, later in the code numa_set_membind(nodemast_t) is called and the src array is assumed in the node src_memo_node_int. I am really confused why the global arrays can be resided in different node according to src_memo_node_int when the compiler allocates them. So it would be appreciated if you could explain it for me."

Reply -
"There is a first touch policy in place, first CPU or to be more precise first Core accessing the page will be responsible for faulting the page in and assigning it on local CPU node memory. For Ex. If Core 9 touches memory first then page will be allocated on node memory of CPU socket having Core 9.There is a difference between assigning memory & allocating memory.
Memory gets allocated in node at the time of first access, which in my case is src array initialization with random value & dst array initialization with null value using memcpy call. Uninitialized global variables are termed as BSS(Block Started by Symbol) in process memory context.Use size command before executable and observe the BSS size with/without global variables. At the time when memory is being allocated on a Linux system no space in the physical memory is needed. Physical memory will be allocated when memory is accessed for the first time. This touching of memory has to be implemented in code running on Shared Memory Systems. Touches = Writes/Reads. There is round-robin policy in place instead of first touch, but all major SMS vendors like SGI provide First Touch policy by default.

Ex. A global variable “int z” in the program, what happens when the process/thread of respective core set by nodemask() tries to access it?
● Page fault occurs
● OS looks at the page and realizes it corresponds to a zero page
● Allocates a new physical frame in memory set by membind() node policy and sets all bytes to zero
● Maps the frame into the address space"