Monitoring InfiniBand Fabrics

This document aims to cover strategies for monitoring InfiniBand fabrics.

Discovery

Before we start monitoring InfiniBand traffic on a host, we need to first understand the hardware configuration.

Things we are interested in:

How many Host Channel Adapters (HCA) are there on the host?
How many ports does each HCA have?
What are the Local IDs (LIDs) of each HCA?

To view this information in a concise, human-readable format, run ibstat on the host as root. You should get output similar to the following:

root@o186i221:~# ibstat
CA 'mlx5_0'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.33.1048
	Hardware version: 0
	Node GUID: 0x88e9a4ffff317104
	System image GUID: 0x88e9a4ffff317104
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 200
		Base lid: 152
		LMC: 0
		SM lid: 30
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff317104
		Link layer: InfiniBand
CA 'mlx5_1'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.33.1048
	Hardware version: 0
	Node GUID: 0x88e9a4ffff3161a8
	System image GUID: 0x88e9a4ffff3161a8
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 200
		Base lid: 147
		LMC: 0
		SM lid: 30
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff3161a8
		Link layer: InfiniBand
CA 'mlx5_2'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.33.1048
	Hardware version: 0
	Node GUID: 0x88e9a4ffff31714c
	System image GUID: 0x88e9a4ffff31714c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 200
		Base lid: 154
		LMC: 0
		SM lid: 30
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff31714c
		Link layer: InfiniBand
CA 'mlx5_3'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.33.1048
	Hardware version: 0
	Node GUID: 0x88e9a4ffff4299a4
	System image GUID: 0x88e9a4ffff4299a4
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x8ae9a4fffe4299a4
		Link layer: Ethernet
CA 'mlx5_4'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.33.1048
	Hardware version: 0
	Node GUID: 0x88e9a4ffff317168
	System image GUID: 0x88e9a4ffff317168
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 200
		Base lid: 155
		LMC: 0
		SM lid: 30
		Capability mask: 0xa651e848
		Port GUID: 0x88e9a4ffff317168
		Link layer: InfiniBand

In this example, our system has 5 single-port ConnectX-6 HCAs, with the following LIDs:

mlx5_0: 152
mlx5_1: 147
mlx5_2: 154
mlx5_3: 0
mlx5_4: 155

If everything has configured correctly when MOFED was installed, then you should see the cards also appearing under /sys/class/infiniband:

root@o186i221:~# ls -l /sys/class/infiniband
total 0
lrwxrwxrwx 1 root root 0 Sep  6 13:17 mlx5_0 -> ../../devices/pci0000:00/0000:00:01.1/0000:03:00.0/0000:04:04.0/0000:0e:00.0/infiniband/mlx5_0
lrwxrwxrwx 1 root root 0 Sep  6 13:17 mlx5_1 -> ../../devices/pci0000:40/0000:40:01.1/0000:43:00.0/0000:44:00.0/0000:45:00.0/infiniband/mlx5_1
lrwxrwxrwx 1 root root 0 Sep  6 13:17 mlx5_2 -> ../../devices/pci0000:80/0000:80:01.1/0000:83:00.0/0000:84:00.0/0000:85:00.0/infiniband/mlx5_2
lrwxrwxrwx 1 root root 0 Sep  6 13:17 mlx5_3 -> ../../devices/pci0000:a0/0000:a0:03.1/0000:a3:00.0/0000:a4:02.0/0000:b0:00.0/infiniband/mlx5_3
lrwxrwxrwx 1 root root 0 Sep  6 13:17 mlx5_4 -> ../../devices/pci0000:c0/0000:c0:01.1/0000:c3:00.0/0000:c4:00.0/0000:c5:00.0/infiniband/mlx5_4

Underneath the device’s directory, you can find counter files to track real-time statistics about each port. For example:

/sys/class/infiniband/mlx5_0/ports/1/counters
├── excessive_buffer_overrun_errors
├── link_downed
├── link_error_recovery
├── local_link_integrity_errors
├── multicast_rcv_packets
├── multicast_xmit_packets
├── port_rcv_constraint_errors
├── port_rcv_data
├── port_rcv_errors
├── port_rcv_packets
├── port_rcv_remote_physical_errors
├── port_rcv_switch_relay_errors
├── port_xmit_constraint_errors
├── port_xmit_data
├── port_xmit_discards
├── port_xmit_packets
├── port_xmit_wait
├── symbol_error
├── unicast_rcv_packets
├── unicast_xmit_packets
└── VL15_dropped

Cracking open one of these files reveals a port counter value:

root@o186i221:~/# cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_packets
280929405

Port Counters Description

Nvidia - Understanding mlx5 Linux Counters and Status Parameters

Counter	Description	InfiniBand Spec Name
port_rcv_data	The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.	PortRcvData
port_rcv_packets	Total number of packets (this may include packets containing Errors. This is 64 bit counter.	PortRcvPkts
port_xmit_data	The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.	PortXmitData
port_xmit_packets	Total number of packets transmitted on all VLs from this port. This may include packets with errors.	PortXmitPkts
port_xmit_wait	The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).	PortXmitWait

Counter

Description

InfiniBand Spec Name

port_rcv_data

The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

PortRcvData

port_rcv_packets

Total number of packets (this may include packets containing Errors. This is 64 bit counter.

PortRcvPkts

port_xmit_data

The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

PortXmitData

port_xmit_packets

Total number of packets transmitted on all VLs from this port. This may include packets with errors.

PortXmitPkts

port_xmit_wait

The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).

PortXmitWait

Monitoring Counters

As time goes on, these counters will be added to as packets are received or transmitted on the HCA port. We could write a tool to read these counters over time, but fortunately there are already tools out there that do reading, aggregation, and resetting of these counters into a nice format.

Perfquery

perfquery Manpage

One of the tools, perfquery comes with the infiniband-diags package which gets installed when you install the rdma-core module.

This allows you to query port counters and reset them after reading, making interval-based queries much easier.

Perfquery Usage

Usage: perfquery [options]  [<lid|guid> [[port(s)] [reset_mask]]]

Options:
  --extended, -x          show extended port counters
  --xmtsl, -X             show Xmt SL port counters
  --rcvsl, -S             show Rcv SL port counters
  --xmtdisc, -D           show Xmt Discard Details
  --rcverr, -E            show Rcv Error Details
  --extended_speeds, -T   show port extended speeds counters
  --oprcvcounters         show Rcv Counters per Op code
  --flowctlcounters       show flow control counters
  --vloppackets           show packets received per Op code per VL
  --vlopdata              show data received per Op code per VL
  --vlxmitflowctlerrors   show flow control update errors per VL
  --vlxmitcounters        show ticks waiting to transmit counters per VL
  --swportvlcong          show sw port VL congestion
  --rcvcc                 show Rcv congestion control counters
  --slrcvfecn             show SL Rcv FECN counters
  --slrcvbecn             show SL Rcv BECN counters
  --xmitcc                show Xmit congestion control counters
  --vlxmittimecc          show VL Xmit Time congestion control counters
  --smplctl, -c           show samples control
  --all_ports, -a         show aggregated counters
  --loop_ports, -l        iterate through each port
  --reset_after_read, -r  reset counters after read
  --Reset_only, -R        only reset counters
  --config, -z <config>   use config file, default: /etc/infiniband-diags/ibdiag.conf
  --Ca, -C <ca>           Ca name to use
  --Port, -P <port>       Ca port number to use
  --Lid, -L               use LID address argument
  --Guid, -G              use GUID address argument
  --timeout, -t <ms>      timeout in ms
  --sm_port, -s <lid>     SM port lid
  --m_key, -y <key>       M_Key to use in request
  --errors, -e            show send and receive errors
  --verbose, -v           increase verbosity level
  --debug, -d             raise debug level
  --help, -h              help message
  --version, -V           show version

Examples:
  perfquery 		# read local port's performance counters
  perfquery 32 1		# read performance counters from lid 32, port 1
  perfquery -x 32 1	# read extended performance counters from lid 32, port 1
  perfquery -a 32		# read performance counters from lid 32, all ports
  perfquery -r 32 1	# read performance counters and reset
  perfquery -x -r 32 1	# read extended performance counters and reset
  perfquery -R 0x20 1	# reset performance counters of port 1 only
  perfquery -x -R 0x20 1	# reset extended performance counters of port 1 only
  perfquery -R -a 32	# reset performance counters of all ports
  perfquery -R 32 2 0x0fff	# reset only error counters of port 2
  perfquery -R 32 2 0xf000	# reset only non-error counters of port 2
  perfquery -a 32 1-10	# read performance counters from lid 32, port 1-10, aggregate output
  perfquery -l 32 1-10	# read performance counters from lid 32, port 1-10, output each port
  perfquery -a 32 1,4,8	# read performance counters from lid 32, port 1, 4, and 8, aggregate output
  perfquery -l 32 1,4,8	# read performance counters from lid 32, port 1, 4, and 8, output each port

Just executing perfquery by itself will return counter results for the first HCA device/port it finds on the host. This isn’t super helpful if we have multiple cards. Instead, we’ll want to specify a LID and port:

perfquery <device_lid> <device_port>

Example:

root@o186i221:~# perfquery 155 1
# Port counters: Lid 155 port 1 (CapMask: 0x5A00)
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrorCounter:..............0
LinkErrorRecoveryCounter:........0
LinkDownedCounter:...............0
PortRcvErrors:...................0
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
QP1Dropped:......................0
VL15Dropped:.....................0
PortXmitData:....................282509
PortRcvData:.....................350990
PortXmitPkts:....................4225
PortRcvPkts:.....................5420
PortXmitWait:....................0

A more powerful usage is to perform a --reset_after_read, to reset the counters back to 0 after we’ve read them.

This lets us run perfquery on an interval-based loop, resetting after each read, getting statistics for the last interval. In this example we read the counters every second, filtering the results by just the received and transmitted packets and data.

watch -n 1 'perfquery -r 152 1 | grep -P "Port(Xmit|Rcv)(Data|Pkts)"'

Example:

Every 1.0s: perfquery -r 152 1 | grep -P "Port...  o186i221: Thu Sep  7 20:26:06 2023

PortXmitData:....................1008
PortRcvData:.....................1452
PortXmitPkts:....................14
PortRcvPkts:.....................20

This is great for monitoring immediate activity on a single card, but on its own doesn’t track activity across multiple cards or aggregate information. Hence why I’ve built the following tool, IBmon, to use this under the hood and capture information to CSV files in a more post-processing-friendly way.

IBmon

IBmon - GitHub

Clone the omniscient repository so you can use the ibmon.sh and aggregate.py scripts:

git clone https://github.com/inf0rmatiker/omniscient
cd omniscient/monitoring/ibmon/

This collects interval-based snapshots of the port counters into a concise CSV file that you can load using a Pandas DataFrame, then graph using Matplotlib.

IBmon Usage

To run the ibmon.sh script you’ll need to provide it with a few pieces of information as positional arguments:

ibmon.sh <output_dir> <monitor_id> <snapshot_seconds> <total_snapshots> <ib_devices> <device_port>

output_dir: The directory you want the snapshot results to go to.
monitor_id: The unique ID of the monitor session. This could be either a number, date string, or UUID — really whatever you want.
snapshot_seconds: The interval between snapshots in seconds. Default is usually 1.
total_snapshots: The total amount of snapshots you wish to take before exiting.
ib_devices: A comma-separated string of mlx5 device names you wish to monitor.
device_port: The port ID on each device you wish to monitor. Only 1 port is supported for monitoring currently. Default is 1 for the first port.

Example:

# ./ibmon.sh /tmp/omniscient/ root-20230911-200447 1 5 "mlx5_0,mlx5_1,mlx5_2,mlx5_4" 1
Port LID for mlx5_0, port 1: 152
Port LID for mlx5_1, port 1: 147
Port LID for mlx5_2, port 1: 154
Port LID for mlx5_4, port 1: 155

Resulting output files:

# tree /tmp/omniscient/
/tmp/omniscient/
├── o186i221_1694189342.ibmon.pid
├── o186i221_mlx5_0_152_root-20230911-200447.ibmon.csv
├── o186i221_mlx5_1_147_root-20230911-200447.ibmon.csv
├── o186i221_mlx5_2_154_root-20230911-200447.ibmon.csv
└── o186i221_mlx5_4_155_root-20230911-200447.ibmon.csv

0 directories, 5 files

Then you can use aggregate.py to aggregate the CSV results:

python3 aggregate.py example/ root-20230911-200447

This outputs two files:

example/o186i221_root-20230911-200447_host_aggregate.ibmon.csv : This is a file that’s aggregated all the device captures to a single file for a given host.
example/root-20230911-200447_total_aggregate.ibmon.csv : This is a file that’s aggregated all the host (and underlying device) captures to a single file for a given monitor, which may span multiple nodes in the cluster.

You can process/graph this file however you like (Microsoft Excel, Matplotlib, etc). Personally, I like to use a Jupyter Notebook to plot these results. You can see an example here:

InfiniBand Monitor Results

Omniscient

ibmon.sh sits in the omniscient repo, which is a set of utility scripts and a CLI for monitoring a cluster in a distributed fashion.

This is easier than using the individual ibmon.sh and aggregate.py scripts directly; everything is handled for you by your omni configuration settings. View the Omniscient documentation below for how to use this tool.

omniscient

IOR Benchmarks with System Monitoring

This video provides an overview for capturing InfiniBand metrics with omni while an experiment is ongoing:

References

OpenFabrics Workshop - Sandia National Labs Presentation Host Based InfiniBand Network Fabric Monitoring