Wednesday, May 30, 2012

TOE: TCP Offload Engine on NIC & Packet Capture Misinterpretations

A quick post about TOE (TCP Offload Engine) present these days in about all NIC's. If enabled TCP/IP operations of packets are processed on NIC without interrupting CPU and consuming PCI bandwidth. On Linux systems, TOE can be configured through standard utility Ethtool. Remember, all these parameters are interface specific and process requesting access to network stack to send packet will inherit properties of interface and consequentially knowing about which operation needs to be done by itself and which is going to be offloaded.

To check what is the status of current system supported operations offloaded to NIC use following switch in Ethtool.
ethtool -k ethX
server1 root [tmp] > ethtool -k eth1
Offload parameters for eth11:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
Detailed description of all offloaded operations is out of scope for this post. Refer to online resources for that. I will try to provide one-liner descriptions of important fields to proceed.
  • Scatter-Gather I/O - Rather than passing one large buffer, small buffers are passed which makes up large buffers. This provides more efficiency than large buffers passed.
  • TCP Segmentation Offload - It is the ability to frame data according to size of MTU & same IP header with all packets. Useful when buffer is much larger than MTU on the link. The segmentation into smaller size is offloaded to NIC.
  • Generic Segmentation Offload - This is used to postpone the segmentation as long as possible. This performs the segmentation just before the entry into the driver's xmit routine. GSO & TSO are only significantly effective only when MTU is much less than buffer size.
  • Generic Receive Offload - GSO only works for transmission of packets. This allows the packets to be re-fragmented at output. Unlike LRO which merges every packets, GRO merges with restriction keeping important fields in packet intact. NAPI API polls for new packets and process packets in batches before passing it to OS.
  • Large Receive Offload - This is used for combining multiple incoming packets into single buffer before passing it up to OS stack. Benefits of this is OS sees fewer packets & uses less CPU time.
Depending upon your NIC vendor, names of these processes may vary. Some vendors do provide additional offload processes. My test hardware is having above mentioned features. For the sake of test, I have disabled GRO & LRO. Operation UFO is generally off on all NIC's, reason behind this is UDP packet acknowledgments if they were used are implemented at application layer, CPU needs to be more transparent of all packets & replies. For TSO to work RX,TX & SG are needed to enabled. To enable/disable these operations usage of Ethtool is as follows. Customize it according to requirement. It is to be noted that I kept same TOE configurations on FTP server & client.
ethtool -K ethX rx on/off tx on/off sg on/off tso on/off gso on/off gro on/off lro on/off
I have generated a 4KB random data file using dd utility to transfer through FTP.
dd if=/dev/urandom of=dat.file bs=1k count=4
MTU for the interface under test on server & client is 1500 Bytes. Packet Captures are performed through Tcpdump on client server and later analyzed on Wireshark.
tcpdump -i eth1 -w TOE_test.pcap -n not port 22
4KB file is segmented into chunks of data and multiple packets will flow through link. This TCP segmentation is usually done by CPU in absence of TOE, but if TOE is enabled packets will be encapsulated at OS layer directly as 2920 bytes. Now this is very weird if you don't know about TOE and start wondering how 2920 bytes can be sent over 1500 bytes MTU Ethernet Frame. This is the difference between practical observations & theoretical understanding.  In Screenshot FTP PUT operation from client to server sends 4KB file in two packets, one of them is highlighted and carries 2920 bytes of data followed by packet of remaining bytes. Here TCP stack is in NIC domain. These packets are handed over to TOE of NIC to do segmentation and sequencing of packets. Packet Capture program hook is exactly at the boundary of OS stack, hence we cannot see actual TCP segmentation happening inside NIC. Ack's of data are dependent on size of data. Disabling GSO/TSO results in normal operation of OS TCP/IP stack. Data packets becomes 1460 bytes which is regular size for 1500 bytes MTU links. TCP operations responsibility shifts to OS stack when TOE is disabled.

TCP Segmentation Offload & Generic Segmentation Offload Enabled
TCP Segmentation Offload & Generic Segmentation Offload Disabled
Performance improvements are observed by use of TOE's for servers serving large number of concurrent sessions serving homogeneous large data files. PCI bandwidth is conserved because of less management overhead of TCP segmentation which involves continuous communication between NIC & CPU. I am doing study on the performance implications of TOE & will post soon about it. This TOE behavior is needed to be understood because it also affects IDS/IPS systems like Snort which performs threat signature matching through packet captures. That's it for now. :)

Friday, May 4, 2012

Soft-Partitioning of SMP Machines using Cpuset Enviornment

Cpuset VFS(Virtual File System) Environment provides us a way to assign a set of CPU's and Memory Nodes to a bunch of tasks. This is specially useful for NUMA systems. When I say Node I mean a Physical CPU and its memory bank, so 16 CPU machine has 0-15 Nodes with its own memory bank as local memory and other CPU's memory bank as remote shared memory connected through ccNUMA interconnect.

You will find Taskset to be a very similar utility in Linux kernel to perform processing Core isolation on SMP systems, however the memory allocation policy after using Taskset lies completely with operating system's kernel page allocator module. Linux kernel NUMA allocator will decide where to allocate memory pages for the process running in Taskset environment. Taskset applies to specific program you are running, sometimes this is very time consuming. So I prefer Cpuset over Taskset because you can bind memory in Cpuset environment and best part is Cpuset applies to shell environment and consequently to all processes invoked within that specific shell.

Cpuset is useful if you have large SMP machine with no Virtualization present. It gives you the capability to do efficient Job placement by managing resources. Think of it as a 1980's Virtualization by Isolation technique, just kidding. Cpuset performs its operation by using sched_setaffinity() system call to include CPU's in its CPU affinity mask & using mbind(2), set_mempolicy(2) system calls to tell kernel page allocator to allocate page on specific node.The whole process of creating Cpuset construct can be automated using scripts which can be invoked before starting a specific HPC job. So according to nature of Jobs one can select from a pool of scripts to create a hierarchical construct of Cpusets.

Kernel Cpuset is mounted at /dev/cpuset. Remember this is an Virtual File System. From User Space one can create directory inside this /dev/cpuset directory to create their own Cpusets. The files present inside /dev/cpuset will be reflected inside user created Cpuset right after creation of directory with no values. Cpuset can be marked exclusive which ensures that no other Cpuset (except direct ancestors and descendants) may contain any overlapping CPU's or Memory Nodes. Remember Cpuset masks off the CPU's from the shell process & its descendants which are not specified for a particular Cpuset. Same analogy goes with memory, if Cpuset says to use memory from specific node only, process in that shell bound to user created Cpuset will be allocated memory from respective node & if it isn't sufficient, system will start using swap partition. Furthermore if your swap is full then you are out of luck and process will terminate. So be careful when using Cpusets, plan your memory utilization accordingly and also take NUMA latencies into consideration while allocating page locations for memory.Check status of Cpus_allowed & Mems_allowed at /proc/<<pid>>/status to see currently masked CPU's and memory nodes. If memory spreading is turned off, i.e. memory interleaved mode is turned off then current specified NUMA policy (1) –interleave 2) –preferred 3) –membind 4) --localalloc) applies to memory allocation. Implementation of memory spread is very simple and follows Round Robin algorithm in allocating memory pages on nodes, perhaps this may induce NUMA access latencies if processing node & memory nodes are distant.

By default, in Kernel there is one scheduling domain. One more benefit of firing jobs inside Cpusets is scheduling overheads are less inside Cpusets resulting in less context switches. Cpuset also provides option to enable or disable job scheduling inside Cpuset domain. For Jobs requiring less Cores to work, like 32 Cores, will be greatly benefited from Cpuset construct.These are just few scenarios I have spread light on, configuration and management of Cpusets is huge topic so go explore.

Using Cpusets in SLES on ccNUMA systems -
1) cd /dev/cpuset
2) mkdir svp_env
3) cd svp_env
4) /bin/echo 0-15 > cpus
Use 0-15 Cores for svp_env construct.
5) /bin/echo 0 > mems
Use memory of Node 0 for svp_env construct.
6) /bin/echo 1 > cpu_exclusive
7) /bin/echo 1 > mem_exclusive
8) /bin/echo $$ > tasks
Attach current shell to svp_env construct.
9) cat /proc/self/cpuset
To check current cpuset environment.
10) sh
Fork new shell inside svp_env cpuset shell.
11) cat /proc/self/cpuset
You will find the same cpuset environment of parent shell.
12) you can dynamically add or remove cpus & mems by using same shell.
13) Removing Cpusets - Make sure no shell is attached to Cpuset you want to remove. If present terminate it & use rmdir command on directory inside /dev/cpuset. "rm -rf" will not work on Cpuset vfs directories.

So, enough of this jabber.Now go & restrict evil processes from getting CPU's. :)