Friday, May 4, 2012

Soft-Partitioning of SMP Machines using Cpuset Enviornment

Cpuset VFS(Virtual File System) Environment provides us a way to assign a set of CPU's and Memory Nodes to a bunch of tasks. This is specially useful for NUMA systems. When I say Node I mean a Physical CPU and its memory bank, so 16 CPU machine has 0-15 Nodes with its own memory bank as local memory and other CPU's memory bank as remote shared memory connected through ccNUMA interconnect.

You will find Taskset to be a very similar utility in Linux kernel to perform processing Core isolation on SMP systems, however the memory allocation policy after using Taskset lies completely with operating system's kernel page allocator module. Linux kernel NUMA allocator will decide where to allocate memory pages for the process running in Taskset environment. Taskset applies to specific program you are running, sometimes this is very time consuming. So I prefer Cpuset over Taskset because you can bind memory in Cpuset environment and best part is Cpuset applies to shell environment and consequently to all processes invoked within that specific shell.

Cpuset is useful if you have large SMP machine with no Virtualization present. It gives you the capability to do efficient Job placement by managing resources. Think of it as a 1980's Virtualization by Isolation technique, just kidding. Cpuset performs its operation by using sched_setaffinity() system call to include CPU's in its CPU affinity mask & using mbind(2), set_mempolicy(2) system calls to tell kernel page allocator to allocate page on specific node.The whole process of creating Cpuset construct can be automated using scripts which can be invoked before starting a specific HPC job. So according to nature of Jobs one can select from a pool of scripts to create a hierarchical construct of Cpusets.

Kernel Cpuset is mounted at /dev/cpuset. Remember this is an Virtual File System. From User Space one can create directory inside this /dev/cpuset directory to create their own Cpusets. The files present inside /dev/cpuset will be reflected inside user created Cpuset right after creation of directory with no values. Cpuset can be marked exclusive which ensures that no other Cpuset (except direct ancestors and descendants) may contain any overlapping CPU's or Memory Nodes. Remember Cpuset masks off the CPU's from the shell process & its descendants which are not specified for a particular Cpuset. Same analogy goes with memory, if Cpuset says to use memory from specific node only, process in that shell bound to user created Cpuset will be allocated memory from respective node & if it isn't sufficient, system will start using swap partition. Furthermore if your swap is full then you are out of luck and process will terminate. So be careful when using Cpusets, plan your memory utilization accordingly and also take NUMA latencies into consideration while allocating page locations for memory.Check status of Cpus_allowed & Mems_allowed at /proc/<<pid>>/status to see currently masked CPU's and memory nodes. If memory spreading is turned off, i.e. memory interleaved mode is turned off then current specified NUMA policy (1) –interleave 2) –preferred 3) –membind 4) --localalloc) applies to memory allocation. Implementation of memory spread is very simple and follows Round Robin algorithm in allocating memory pages on nodes, perhaps this may induce NUMA access latencies if processing node & memory nodes are distant.

By default, in Kernel there is one scheduling domain. One more benefit of firing jobs inside Cpusets is scheduling overheads are less inside Cpusets resulting in less context switches. Cpuset also provides option to enable or disable job scheduling inside Cpuset domain. For Jobs requiring less Cores to work, like 32 Cores, will be greatly benefited from Cpuset construct.These are just few scenarios I have spread light on, configuration and management of Cpusets is huge topic so go explore.

Using Cpusets in SLES on ccNUMA systems -
1) cd /dev/cpuset
2) mkdir svp_env
3) cd svp_env
4) /bin/echo 0-15 > cpus
Use 0-15 Cores for svp_env construct.
5) /bin/echo 0 > mems
Use memory of Node 0 for svp_env construct.
6) /bin/echo 1 > cpu_exclusive
7) /bin/echo 1 > mem_exclusive
8) /bin/echo $$ > tasks
Attach current shell to svp_env construct.
9) cat /proc/self/cpuset
To check current cpuset environment.
10) sh
Fork new shell inside svp_env cpuset shell.
11) cat /proc/self/cpuset
You will find the same cpuset environment of parent shell.
12) you can dynamically add or remove cpus & mems by using same shell.
13) Removing Cpusets - Make sure no shell is attached to Cpuset you want to remove. If present terminate it & use rmdir command on directory inside /dev/cpuset. "rm -rf" will not work on Cpuset vfs directories.

So, enough of this jabber.Now go & restrict evil processes from getting CPU's. :)

No comments:

Post a Comment