Tuesday, June 26, 2012

MPI + OpenMP Thread + CUDA Kernel Thoughts

Random thoughts -
  • Complex jobs require Shared Memory Model & Distributed Memory Model at the same time for optimum performance. 
  • Engineering and related jobs require serious number crunching capabilities, GPU's are useful here.
  • Above block diagram is a possible parallel system architecture. Head node does the job scheduling on various soft-partitioned cluster nodes. Head node administrator can choose scheduling algorithm depending upon the type of jobs running.
  • Standard scheduling algorithms:
    • FIFO (First-In First-Out) - If all the jobs are of equal length and requires same CPU time for completion. Irregular jobs may result in small jobs getting stuck behind larger jobs.
    • RR (Round Robin) - Provides completion of smaller jobs but involves significant amount of context switches.
    • SJF (Shortest Job First) - Prefers I/O bound processes first.
    • SRTF (Shortest Remaining Time First) - Allows short time CPU process to run first by preempting large time process.
    • Priority based - name itself says enough.
    • MLFQ (Multi-level Feedback Queue) - Adjust priority according to burst CPU usage.
    • Linux O(1) & O(n) scheduler.
  • Custom scheduler can be constructed depending upon the job type and keeping hybrid-model in mind. 
  • Big Shared Memory Systems can be soft-partitioned (Cpusets) or affinity-partitioned (Kmp-affinity) to distribute jobs into respective partitions. cc-NUMA systems can take advantage of globally addressable memory between multiple partition and eventually inter-dependent jobs can share their data.
  • For each partitioned system, MPI process can invoke OpenMP threads to run on 32 cores.
  • Parallel construct with lot of test conditions performs well on CPU's rather than GPU's. So multiple partitions can be subscribed for typical job with significant amount of critical sections.
  • Number crunching or String comparison operations where rapid parallel kernel execution is necessary can be offloaded to GPU's.
  • Multiple GPU's can be subscribed from the pool, peer-to-peer memory access between GPU's improves computation/memory communication overlap.
  • Memory can be copied from one partition to another subscribed partition on different nodes using intra-cluster MPI interconnects.
  • Interconnect layer can be of ultra-low latency (microseconds) Infiniband (IB) stack. RDMA (Remote DMA) capabilities of IB network are of great use. MPI can leverage the capability of IB interconnect by using RDMA to send/recv data directly into the application memory without interrupting CPU.
  • OpenMP threads can handle multiple GPU devices based on the requirement.
  • If single job needs to be placed on these systems, partitions can be dissolved and all CPU/GPU resources will be available to single job.
  • Furthermore all GPU's can placed behind a GPU scheduler head node & GPU jobs will be offloaded to GPU head node.Many customization's are possible for this hybrid architecture.
 That's it for now, but there is lot in the pipe for HPC systems coming on my blog so be tuned. Happy threading!!!