Sunday, March 11, 2012

Analysis of OpenMP Schedule Clause in Parallel Matrix Multiplication

In this post, I will describe some findings on my first encounter with Parallel Programming using OpenMP 3.1 API. 
I discuss here the simple usage of OpenMP API to speedup the operation which otherwise would be single threaded taking loads of time to complete. Let's discuss few basics of OpenMP before proceeding to Matrix Multiplication.

OpenMP Thread Model

Few points regarding OpenMP:
  • OpenMP is designed keeping in mind the perfect UMA (Uniform Memory Architecture) system, where all memory access latencies are the same. However, a NUMA (Non-uniform Memory Architecture) systems has variable latencies depending upon the distance and design topology used. I have used NUMA system here for benchmarking.
    • Detailed discussion of NUMA is beyond scope for this post. You can refer to my earlier post
                    Shared Memory Performance in cc-NUMA Systems with Libnuma API
  • Thread placement can be controlled using various CPU affinity features of OS as well as OpenMP; in Linux NUMA systems threads are placed on CPU's closer to allocated memory region to avoid high memory access latencies. This is dynamically controlled by Linux kernel unless specified by user. OpenMP 4.x version will have better support for NUMA systems.
  • Code can contain both serial and parallel statements, all the ugly syntax of threading is hidden behind the OpenMP pragma directives.
  • OpenMP is best for loop level parallelism, without much critical sections.
  • Threads will be forked according to the pragma directives and joined after completion. Using OpenMP we can construct multiple parallel regions within the same code. Multiple threads can be forked on occurrence of every parallel region construct.
  • OpenMP allows us to fork threads once and then distribute nested parallel regions on these threads  reducing the overhead of thread creation on every parallel construct. Operations such as flush() helps us to maintain integrity of data between nested threads.
  • OpenMP usually distributes 1 thread per core. More threads per core can cause decrease in performance by stealing CPU time, however this behavior depends upon the application and system configurations.
  • Thread creation involves certain overheads which must be taken into consideration while writing scalable parallel code.
Advantages of OpenMP:
  •  Easy to program. More readable code.
  •  Data layout is  handled automatically by OpenMP directives.
  •  Threads can move from one core to other, unless specifically bound.
  •  Easy modification of environment settings.
  •  Guaranteed to perform efficiently on Shared Memory Systems.
  •  Portable and widely available.
Disadvantages of OpenMP:
  • Performance may decrease after finite number of threads, as all threads will compete for shared memory bandwidth. To reduce this threads must be placed properly on nearer CPU's depending upon the application requirement.
  • Thread Synchronization is expensive.
  • Available on SMP (Symmetric Multiprocessing) systems only.
  • Not as scalable as MPI (Message Passing Interface), although Hybrid model of MPI + OpenMP + OpenMP CUDA is getting a lot of attention. We will discuss it in later posts.

Parallel Matrix Multiplication

As part of learning OpenMP I have written a code for Parallel Matrix Multiplication. Algorithm used here for Matrix Multiplication is conventional one (We all learned it in school). Lots of optimized Matrix Multiplication algorithms exists, but lets leave that to other post.

2-Thread Parallel Matrix Multiplication Flow

System/Software Information:
  1. System - SGI Altix UV1000 (128 core 16 CPU Node 1 TB RAM Fat-Tree model cc-NUMA)
  2. OS - RHEL 6.1 x86_64
  3. GCC version - Red Hat 4.4.5-6
C Code Snippet:
Complete code can be downloaded here - omp_mmul_svp.zip
    /*Assigning matrix elements with static data, conditional pragma directive*/
printf("Assigning array elements with '1' for multiply op...\n");
#pragma omp parallel for private(i,j) if(mat_size>2000)
for(i=0;i<mat_size;i++)
    {
    for(j=0;j<mat_size;j++)
        {
        m1[i][j]=1;
        m2[i][j]=1;
        }
    }

    /*Parallel Matrix Multiplication operation starts*/
printf("Matrix multiplication with openmp...\n");
printf("Chunk Size = %d\n",chunk);
fprintf(f1,"Chunk Size = %d Matrix size = %d Threads = %d\n",chunk,mat_size,threads);
    /*OpenMP time function to calculate real time required by operation*/
start_time = omp_get_wtime();
    /*Parallel construct with shared variables & private variable throughout the region*/
#pragma omp parallel shared(mul,m1,m2) private(threads_id)
{
    /*Report thread number*/
threads_id = omp_get_thread_num();
if (threads_id == 0)
    {
    /*Master thread will report total number of threads invoked*/
    tot_threads = omp_get_num_threads();
    printf("Total worker threads invoked = %d\n",tot_threads);
    }
    /*Parallel for loop directive with dynamic, chunk size schedule policy.
      Private variables for parallel loop construct.
      schedule options:
      1) schedule (static)
      2) schedule (static, chunk)
      3) schedule (dynamic)
      4) schedule (dynamic,chunk)
    */
#pragma omp for schedule (dynamic, chunk) nowait private(i,j,k,temp)
    /*Outer loop Row parallelization*/
for(i=0;i<mat_size;i++)
    {
    /*printf("Thread=%d row=%d completed\n",threads_id,i);*/
    for(j=0;j<mat_size;j++)
           {
           temp = 0;
           for(k=0;k<mat_size;k++)
                  {
                  temp+=m1[i][k]*m2[k][j];
                  }
           mul[i][j] = temp;
           }
    }
}
end_time = omp_get_wtime();
printf("[+]multiply_op_omp_time %f sec\n", end_time-start_time);

Code Compilation and Output:
I have compiled the code with GCC Optimization level 2. Sample code output shown below is produced for 128 threads, 3000 matrix size  chunk size 1.
server1 [root]> gcc -O2 omp_mmul_svp.c -o omp_mmul -fopenmp
server1 [root]> ./omp_mmul 128 3000 1
Multiply Array memory allocation...
Assigning array elements with '1' for multiply op...
Matrix multiplication with openmp...
Chunk Size = 1
Total worker threads invoked = 128
[+]multiply_op_omp_time 4.379983 sec
Single thread verification complete...OK

Thread Monitoring:
To monitor threads on system, start Top utility with Threads toggle option enabled.
top -H
Threads are part of same process. Using P column in Top we can determine the core used for processing of thread.

Code Information:
  • Frequently used shared and thread private variables are initialized as register variables, to suggest compiler to keep them in CPU registers for faster access.
  • Number of worker threads to be invoked is provided through omp_set_num_threads() function. If not specified OpenMP will invoke 1 thread for each CPU core in the system.
  • 2D array is initialized followed by memory allocation using malloc() function
  • Now it is necessary to analyze the result of matrix multiplication because it could contain erroneous results due to race conditions which are common in parallel programming. So I choose static data for matrix elements to verify the results easily using serial code. Here I have assigned  m1 and m2 matrices with '1' as element value, so that each element in mul matrix would end up as same value given by user as matrix size. For ex, If matrix size is 3000, then mul[0][0] would be 3000 if operation is error-free.
  • Now assigning value to elements of m1 and m2 matrices will take more time especially if matrix size is large, so to circumvent this I have used conditional pragma directive here. If matrix size provided is more than 2000 for loop will become parallel for loop. Please note that number of threads invoked will depend upon value given by user. Placement of omp_set_num_threads() call statement in code is very important.
  • Parallel assigning of values to m1 and m2 matrices improves performance because the assignment operator is independent of each other. Never use function calls in parallel unless they are parallel too.
  • Operation time calculated is only for multiplication section of code. OpenMP provides function omp_get_wtime() to report the current time. Difference between start time and end time is the time required to perform the parallel multiplication operation.
  • Parallel region with global shared variables for matrix data and global private variable for thread number is followed by parallel for region. Pragma for statement applies only to one loop, nested for loop parallelism is achieved using multiple pragma omp directives. By default in pragma for directive 'i' variable will become private, though I explicitly declared them to be private variables to make code more understandable. Default type in pragma omp parallel construct is shared unless specified, you can change it by using default(private) in pragma omp parallel statement.
  • Nowait clause in pragma for statement will remove the implied barrier at end of parallel region and continues to execute next statement in pragma omp parallel region. Threads that complete their jobs early can proceed straight to next statements in worksharing constructs without waiting for other threads to finish. This is typically used if more than one pragma for loops are present in parallel region. Combination of schedule clause (work load distribution) and nowait clause must be adjusted to optimize the code.
  • Schedule clause divides iteration of pragma for loops into contiguous subsets known as chunk and distributes them on threads as per the types. These options can be hard-coded into code or can be specified from environment settings at run time. Here I am comparing two types of schedule clauses namely static and dynamic.
    • static - Iterations are equally divided and statically distributed among threads. Chunks are calculated automatically and assigned to threads in round-robin fashion depending on order of thread number.
    • static, chunk_size - Iterations are divided into N chunks specified and statically distributed among threads. Distribution remains same as in static, only chunk size is specified here.
    • dynamic -  Iterations are divided according to default chunk size 1 and dynamically distributed among threads. Iteration chunks are distributed to threads as they request them. Threads process the chunks and then requests for new chunks until no chunks are remaining. Default chunk_size is 1, so each thread will be given 1 iteration.
    • dynamic, chunk_size - Iterations are divided into N chunks specified and dynamically distributed among threads. Distribution policy remains same as in dynamic, only difference is each thread will be given N iterations to process. Efficient load balancing is achieved using dynamic mode. Dynamic mode also adds processing overhead.
Note:
  • There are more types in schedule clause which are not discussed here such as guided, auto and runtime refer to OpenMP API guide for details.
  • Depending on your code, I suggest to use Gprof pro-filer to check for unexpected wait states in parallel code. I have explained detailed usage of Gprof in my earlier post 
          Code Profiling in Linux Using Gprof
Thread vs Chunk Size Performance graph:


Observations:
  • As you can see in Thread vs Chunk Size performance graph, single thread process gives different execution time for different chunk sizes. As you can see 1 thread on single core will compete for CPU time, hence one time slot performs operation on N loops. Higher chunk size indicates more performance on single thread with less overhead of dynamic schedule allocation. Single thread operation of OpenMP code is slower than plain serial code due to overheads involved.
  • For 2 thread operation, chunk size 2 takes more time to complete than all other chunk sizes. Chunk size 1 provides maximum performance.
  • For 4 thread operation, chunk size 4 yields maximum performance with approx. 188 chunks to be processed by each thread.
  • If we go on increasing number of threads to maximum cores present in system i.e. 128, the performance difference between different chunk sizes vanishes. There is slight difference between chunk size performance with 128 cores, this is because of the overheads of dynamic schedule option requesting chunks to process.
  • In dynamic mode, threads are spawned slowly as each thread requests chunk of data to process. Since there is a possibility of job completion before spawning of all specified threads, it is necessary to choose chunk size and number of threads wisely if system is used for multiple parallel programs.
Dynamic vs Static Schedule Performance graph:


Observations:
  • This benchmark is performed with no chunk size specified to Schedule clause. For static type, the chunk size will be calculated automatically with equal distribution in mind. For Dynamic type, of chunk is not specified it is 1. So each thread will be provided with 1 iteration of for loop to process, after completion it can request new iteration to process. In NUMA system, iterations being processes closer to allocated memory location will complete faster as compared to distant iteration processing nodes due to memory access latencies.
  • Static will load all specified threads immediately with calculated chunk sizes.
  • Upto 16 thread there is significant difference between performance of dynamic and static schedule, after that difference vanishes.
  • Use of dynamic scheduling increases performance for lower number of threads.
Conclusion:
  • OpenMP behavior is dynamic, tune parameters according to underlying system to get maximum performance.
  • Idling threads are of no use at all; try to load threads equally so that they should complete their job at same time.
  • Use OpenMP only with sufficient workload.
  • Avoid fork/join of threads at every parallel construct, reuse invoked threads.
  • Play with code and observe performance to match best results.
  • Parallel code is very hard to debug as it silently produces wrong results. Verification is mandatory for error-free parallel code.
This is my first code on OpenMP platform and I am very excited to learn more about OpenMP. Happy Parallelism!