Monday, August 6, 2012

Infiniband Xmit and Rcv Link Counter Interpretation using Perfquery

Infiniband Interconnect technology is preferred in HPC domain, because of it's ultra low latency (usec), high bandwidth using multiple lanes and RDMA (Remote DMA) capability. Just like Ethernet it has multiple layers and can carry data at SRP (SCSI RDMA) layer as well as IPoIB (IP over Infiniband) layer at the same time. I wanted to calculate the amount of data transferred on 4 IB ports connected to my server. Following is the quick and dirty method to do so for QDR IB ports. Infiniband-diags package on RHEL 6.1 comes with perfquery utility.

Perfquery wrapper script to report the port counters for Mellanox HCA's :
echo "mlx4_0 P 1 - ib0 port - IPoIB Bond Master Interface"
perfquery -C mlx4_0 -P 1
echo "mlx4_0 P 2 - ib1 port"
perfquery -C mlx4_0 -P 2
echo "mlx4_1 P 1 - ib2 port"
perfquery -C mlx4_1 -P 1
echo "mlx4_1 P 2 - ib3 port - IPoIB Bond Slave Interface"
perfquery -C mlx4_1 -P 2

Monitoring the ports for activity :
Grep the desired output and put watch on it, as follows.
watch -n 0.5 './ | grep -E "mlx|XmitData|RcvData|XmitPkts|RcvPkts"' script to clear the port counters :
perfquery -C mlx4_0 -P 1 -R 
perfquery -C mlx4_0 -P 2 -R 
perfquery -C mlx4_1 -P 1 -R 
perfquery -C mlx4_1 -P 2 -R 
echo "Counters Cleared"

How to interpret the actual data transferred on individual links of Infiniband :
  • PortXmitData/PortRcvData shows the actual data Transmitted/Received on the port. Command output of perfquery indicates values which are divided by 4, because of QDR IB links having 4 links to span data chunks across it.This counters excludes link packets. To get actual data transfer on it in MBytes/sec, follow the given procedure. Suppose I did a dd un-buffered (oflag=direct) write to CXFS storage vault through IB network of size 200 MB, the data of 200 MB is spanned across multiple IB ports and committed to storage controller acting as a SRP target. The CXFS vault I am using for testing purpose is a volume made up of multiple LUN Slices. These LUN Slices are having different XVM preferred paths (failover2.conf) owned by distinct storage controllers. Direct disk platter commit is necessary otherwise there are multiple layers of buffers involved in the whole stack to improve performance. Data gets written into RAM file cache first and gets "sync"* to disk. IB HCA's have their own buffers , storage controllers are having their own caches to improve performance. However we don't get actual data transmitted on IB ports if Linux is using it's own file/buffer cache.
             * - "sync" is literal command in Linux to commit all buffered data to disk, system call related to "sync" is fsync() which can be used in programs.

dd command :
dd if=/dev/zero of=/testvol1/testfile.dat bs=10M count=20 oflag=direct

Calculation for QDR :
XmitData on 2 out of 4 ports is 26564832
mlx4_1 P 1 =((26564832*4)/1024)/1024 = 101.33 MBytes
mlx4_1 P 2 =((26583984*4)/1024)/1024 = 101.40 MBytes
  • XmitPkts/RcvPkts shows the total number of packets transferred on the links. Above XmitData/RcvData is encapsulated on top of these IB pkts. Remember that this excludes the link packets.
  • All counters maintained are per Infiniband link (Single X in 1X/4X/12X) so divide/multiply accordingly.
  • The base data rate of of IB is 1X clocked at 2.5 Gbps and is transmitted over two pairs of transmit & receive and yields an effective data rate of 2 Gbps full duplex (2 Gbps transmit,2 Gbps receive).
  • IB 4X, 12X interfaces uses same base clock rate, but uses multiple pairs where each pair is commonly referred as lane.
  • 4X IB gets a signalling rate of 10 Gbps (8 Gbps data rate) using 4 lane = 8 pairs.
InfiniBand Link Signal Pairs Signaling Rate Data Rate Full-Duplex Data Rate
1X 2 2.5 Gbps 2.0 Gbps 4.0 Gbps
4X 8 10 Gbps 8 Gbps 16 Gbps
12X 24 30 Gbps 24 Gbps 48 Gbps

            Note: Although the signaling rate is 2.5 Gbps, the effective data rate is limited to 2 Gbps, due
to the 8B/10B encoding scheme; i.e., (2.5*8)/10 = 2 Gbps.
  • The data shown in the XmitData/RcvData/XmitPkts/RcvPkts shows the data for the all the IB protocols like SRP/IPoIB/SDP(Sockets Direct Protocol)/etc. This totals the amount of data transmitted on IB stack regardless of layered protocols.

1 comment:

  1. Thank you for your article.
    So if I use these commands in sequence:
    perfquery -R;sleep 1;perfquery -x
    to reset, wait 1 sec, read port counters and the result will be
    # perfquery -R;sleep 1;perfquery -x
    # Port extended counters: Lid 35 port 1 (CapMask: 0x200)
    [root@wn005 ~]#
    I can derive that
    the transmit BW on that port is:
    601*4/1024/1024 MByte/sec
    and receive BW is 594*4/1024/1024 MByte/sec
    And what about Unicast and Multicast packets?

    Fedele Stabile