Lustre Networking (LNET)

Original Lustre documentation is linked below.

LNET Utilites

  • lctl: Control Lustre via ioctl interface

  • lnetctl: Manage LNET configurations

LNET Configuration

This section assumes you’re using an already-configured Infiniband fabric with IP over InfiniBand (IPoIB). To see how to do this prerequisite step, view the InfiniBand Documentation.

Configure LNET, and add the ib0 physical interface as the o2ib network

Load the lnet kernel module

modprobe lnet
lnetctl lnet configure
lnetctl net add --net o2ib --if ib0

Bring up the LNET network using lctl

[root@mawenzi-06 ~]# lctl network up
LNET configured

Show the network using lnetctl

[root@mawenzi-01 ~]# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 192.168.0.101@o2ib
          status: up
          interfaces:
              0: ib0

Client LNET Optimizations

  • Network Checksums: These are used to protect against network corruption, but on a reliable, high-speed network like Slingshot or InfiniBand, there is little use for this. You’ll want to disable this if it’s not already. It’s disabled (0) by default.

    lctl set_param osc.<filesystem_name>*.checksums=0
    • You can check the checksum settings using the following example:

      [ccarlson@n01 ~]$ sudo lctl get_param osc.aiholus1*.checksums
      osc.aiholus1-OST0000-osc-ffffa1584d9b1800.checksums=0
      osc.aiholus1-OST0001-osc-ffffa1584d9b1800.checksums=0
      osc.aiholus1-OST0002-osc-ffffa1584d9b1800.checksums=0
      osc.aiholus1-OST0003-osc-ffffa1584d9b1800.checksums=0
      osc.aiholus1-OST0004-osc-ffffa1584d9b1800.checksums=0
      osc.aiholus1-OST0005-osc-ffffa1584d9b1800.checksums=0
  • Max RPCs In-Flight: This indicates the maximum number of remote procedural calls on the LNET. By default, this is 8. For high-speed networks, increase this to 64.

    lctl set_param mdc.<filesystem_name>*.max_rpcs_in_flight=64
    • You can check the max RPCs in-flight using the following example:

      [ccarlson@n01 ~]$ sudo lctl get_param osc.aiholus1*.max_rpcs_in_flight
      osc.aiholus1-OST0000-osc-ffffa1584d9b1800.max_rpcs_in_flight=64
      osc.aiholus1-OST0001-osc-ffffa1584d9b1800.max_rpcs_in_flight=64
      osc.aiholus1-OST0002-osc-ffffa1584d9b1800.max_rpcs_in_flight=64
      osc.aiholus1-OST0003-osc-ffffa1584d9b1800.max_rpcs_in_flight=64
      osc.aiholus1-OST0004-osc-ffffa1584d9b1800.max_rpcs_in_flight=64
      osc.aiholus1-OST0005-osc-ffffa1584d9b1800.max_rpcs_in_flight=64
      [ccarlson@n01 ~]$ sudo lctl get_param mdc.aiholus1*.max_rpcs_in_flight
      mdc.aiholus1-MDT0000-mdc-ffffa1584d9b1800.max_rpcs_in_flight=64
      mdc.aiholus1-MDT0001-mdc-ffffa1584d9b1800.max_rpcs_in_flight=64
  • Max Pages per RPC: Defines the maximum RPC size sent from the client to the server. Default depends on Lustre version installed, but is around 256 (1MB) for Lustre 2.12, and 4096 (16MB) for Lustre 2.15. We’ll want this at 1024.

    lctl set_param osc.<filesystem_name>*.max_pages_per_rpc=1024
    lctl set_param mdc.<filesystem_name>*.max_pages_per_rpc=256
  • Max Dirty MB: Defines the max amount of dirty data (MB) in client memory that hasn’t yet been written, this can be increased based on the client’s memory capabilities. Default is 2000.

    lctl set_param osc.<filesystem_name>*.max_dirty_mb=2000
    lctl set_param mdc.<filesystem_name>*.max_dirty_mb=2000
  • Max Read-Ahead MB: Defines max data that can be pre-fetched by the client if a sequential read is detected on a file. Default is 64MiB. We’ll want to set this to 512MiB.

    lctl set_param llite.<filesystem_name>*.max_read_ahead_mb=512
    lctl set_param llite.<filesystem_name>*.max_read_ahead_per_file_mb=512

Delete LNET Network

Example existing LNET configuration:

net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 10.10.5.1@tcp
          status: up

Delete the tcp network:

lnetctl net del --net tcp

Persist LNET Configuration Between Boots

If you want your LNET configuration to persist after a reboot, you’ll need to write it to a persistent file, /etc/lnet.conf.

Export an existing LNET configuration to /etc/lnet.conf:

lnetctl export >> /etc/lnet.conf

Multirail Configuration

If you have multiple InfiniBand Channel Adapters, you’ll want to configure LNET to use them in a multirail configuration. This example shows all four ConnectX-6 cards being used. An example of a multirail LNET configuration might look like the following:

[ccarlson@n01 ior-4.0.0rc1]$ sudo lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib
      local NI(s):
        - nid: 10.10.5.1@o2ib
          status: up
          interfaces:
              0: ib0
        - nid: 10.10.5.21@o2ib
          status: up
          interfaces:
              0: ib1
        - nid: 10.10.5.41@o2ib
          status: up
          interfaces:
              0: ib2
        - nid: 10.10.5.61@o2ib
          status: up
          interfaces:
              0: ib3

IB Device ARP Settings

Multiple network interfaces on the same node may cause issues with the OS returning the wrong hardware address for a requested IP. Because o2iblnd uses IPoIB, we can get the wrong address, degrading performance.

Below is an example where we have four Mellanox InfiniBand ConnectX-6 cards on a system, each with their own IP address. We’ll need to turn off ARP broadcasting on these:

[ccarlson@n01 ~]$ ip a | grep ib
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:d4:9a brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.10.5.1/16 brd 10.10.255.255 scope global noprefixroute ib0
15: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:74:66 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.10.5.21/16 brd 10.10.255.255 scope global noprefixroute ib1
16: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:74:2e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.10.5.41/16 brd 10.10.255.255 scope global noprefixroute ib2
17: ib3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:c4:7e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.10.5.61/16 brd 10.10.255.255 scope global noprefixroute ib3

Use the following script to disable ARP broadcasting on all four cards:

#!/bin/bash

# Check we are running as root
USER_ID=$(id -u)
if [[ $USER_ID -ne 0 ]]; then
  echo "Must run as root user"
  exit 1
fi

SUBNET="10.10"
SUBNET_MASK="16"
IB0_IP="10.10.5.1"
IB1_IP="10.10.5.21"
IB2_IP="10.10.5.41"
IB3_IP="10.10.5.61"

#Setting ARP so it doesn't broadcast
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib0.arp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_announce=2
sysctl -w net.ipv4.conf.ib0.rp_filter=0

sysctl -w net.ipv4.conf.ib1.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_announce=2
sysctl -w net.ipv4.conf.ib1.rp_filter=0

sysctl -w net.ipv4.conf.ib2.arp_ignore=1
sysctl -w net.ipv4.conf.ib2.arp_filter=0
sysctl -w net.ipv4.conf.ib2.arp_announce=2
sysctl -w net.ipv4.conf.ib2.rp_filter=0

sysctl -w net.ipv4.conf.ib3.arp_ignore=1
sysctl -w net.ipv4.conf.ib3.arp_filter=0
sysctl -w net.ipv4.conf.ib3.arp_announce=2
sysctl -w net.ipv4.conf.ib3.rp_filter=0

ip neigh flush dev ib0
ip neigh flush dev ib1
ip neigh flush dev ib2
ip neigh flush dev ib3

echo 200 ib0 >> /etc/iproute2/rt_tables
echo 201 ib1 >> /etc/iproute2/rt_tables
echo 202 ib2 >> /etc/iproute2/rt_tables
echo 203 ib3 >> /etc/iproute2/rt_tables

ip route add $SUBNET/$SUBNET_MASK dev ib0 proto kernel scope link src $IB0_IP table ib0
ip route add $SUBNET/$SUBNET_MASK dev ib1 proto kernel scope link src $IB1_IP table ib1
ip route add $SUBNET/$SUBNET_MASK dev ib2 proto kernel scope link src $IB2_IP table ib2
ip route add $SUBNET/$SUBNET_MASK dev ib3 proto kernel scope link src $IB3_IP table ib3

ip rule add from $IB0_IP table ib0
ip rule add from $IB1_IP table ib1
ip rule add from $IB2_IP table ib2
ip rule add from $IB3_IP table ib3

ip route flush cache

PCIe Relaxed Ordering

If you’re using multiple IB CAs in a multirail configuration, you’ll need to set the PCI device ordering to relaxed instead of the default, which is strict. This "allows switches in the path between the Requester and Completer to reorder some transactions just received before others that were previously enqueued to reorder transactions" [1]. Read more about PCI Relaxed Ordering mechanisms here:

The following script can be run to set relaxed ordering on all discovered Mellanox devices using the Mellanox Software Tools (mst) devices.

#!/bin/bash

# Sets device ordering to relaxed for multirail InfiniBand.

# Check we are running as root
USER_ID=$(id -u)
if [[ $USER_ID -ne 0 ]]; then
  echo "Must run as root user"
  exit 1
fi

# Start Mellanox Software Tools (MST)
mst start

# See what cards are available
MST_DEVICES=$(find /dev/mst/ -name "*pciconf*" | sort)
MADE_CHANGE=0
for MST_DEVICE in $MST_DEVICES; do
  echo "Checking $MST_DEVICE..."
  ORDERING=$(mlxconfig -d $MST_DEVICE q | grep "PCI_WR_ORDERING" | xargs | awk '{print $2}')
  echo "$MST_DEVICE PCI write ordering: $ORDERING"
  if [[ $ORDERING == *"0"* ]]; then
    echo "Ordering set to strict. Setting to relaxed..."
    mlxconfig -y -d $MST_DEVICE s PCI_WR_ORDERING=1
    MADE_CHANGE=1
  else
    echo "Ordering already set to relaxed. Skipping."
  fi
done

[[ $MADE_CHANGE -eq 1 ]] && echo "Made changes to PCI device ordering. Reboot the system for them to take effect." || echo "No changes made."

Example output:

[ccarlson@n01 ~]$ sudo ./set_relaxed_ordering.sh
[sudo] password for ccarlson:
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success
Checking /dev/mst/mt4123_pciconf0...
/dev/mst/mt4123_pciconf0 PCI write ordering: force_relax(1)
Ordering already set to relaxed. Skipping.
Checking /dev/mst/mt4123_pciconf1...
/dev/mst/mt4123_pciconf1 PCI write ordering: per_mkey(0)
Ordering set to strict. Setting to relaxed...

Device #1:
----------

Device type:    ConnectX6
Name:           MCX653105A-HDA_HPE_Ax
Description:    HPE InfiniBand HDR/Ethernet 200Gb 1-port MCX653105A-HDAT QSFP56 x16 Adapter
Device:         /dev/mst/mt4123_pciconf1

Configurations:                              Next Boot       New
         PCI_WR_ORDERING                     per_mkey(0)     force_relax(1)

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
Checking /dev/mst/mt4123_pciconf2...
/dev/mst/mt4123_pciconf2 PCI write ordering: per_mkey(0)
Ordering set to strict. Setting to relaxed...

Device #1:
----------

Device type:    ConnectX6
Name:           MCX653105A-HDA_HPE_Ax
Description:    HPE InfiniBand HDR/Ethernet 200Gb 1-port MCX653105A-HDAT QSFP56 x16 Adapter
Device:         /dev/mst/mt4123_pciconf2

Configurations:                              Next Boot       New
         PCI_WR_ORDERING                     per_mkey(0)     force_relax(1)

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
Checking /dev/mst/mt4123_pciconf3...
/dev/mst/mt4123_pciconf3 PCI write ordering: per_mkey(0)
Ordering set to strict. Setting to relaxed...

Device #1:
----------

Device type:    ConnectX6
Name:           MCX653105A-HDA_HPE_Ax
Description:    HPE InfiniBand HDR/Ethernet 200Gb 1-port MCX653105A-HDAT QSFP56 x16 Adapter
Device:         /dev/mst/mt4123_pciconf3

Configurations:                              Next Boot       New
         PCI_WR_ORDERING                     per_mkey(0)     force_relax(1)

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
Made changes to PCI device ordering. Reboot the system for them to take effect.

Tracing lctl ping Code Path for o2iblnd

lctl ping begins in lctl.c, and defines that command to call the jt_ptl_ping function. That function is declared in obdctl.h and defined in portals.c.

In jt_ptl_ping, a struct libcfs_ioctl_data is initialized. Here’s the structure of libcfs_ioctl_data

struct libcfs_ioctl_data {
	struct libcfs_ioctl_hdr ioc_hdr;

	__u64 ioc_nid;
	__u64 ioc_u64[1];

	__u32 ioc_flags;
	__u32 ioc_count;
	__u32 ioc_net;
	__u32 ioc_u32[7];

	__u32 ioc_inllen1;
	char *ioc_inlbuf1;
	__u32 ioc_inllen2;
	char *ioc_inlbuf2;

	__u32 ioc_plen1; /* buffers in userspace */
	void __user *ioc_pbuf1;
	__u32 ioc_plen2; /* buffers in userspace */
	void __user *ioc_pbuf2;

	char ioc_bulk[0];
};

Lastly, l_ioctl is called (defined here).

l_ioctl first opens a file descriptor using open_ioc_dev(LNET_DEV_ID). Then, calls ioctl(fd, opc, buf) where fd is our file descriptor of the LNET device, opc is IOC_LIBCFS_PING, and buf is our libcfs_ioctl_data struct.

Notes on ioctl registration

Before this ioctl call is made, libcfs should have already been registered as devices/files capable of receiving ioctl calls when the modules were loaded.

There’s a struct ioc_dev internally in libcfs

struct ioc_dev {
	const char *dev_name;
	int dev_fd;
};

This is filled out added to a static struct ioc_dev ioc_dev_list[10]; when register_ioc_dev() is called.

When libcfs starts up as a kernel module, it creates a struct miscdevice, with a pointer to struct file_operations libcfs_fops, which in turn has a pointer to the libcfs_psdev_ioctl function:

static const struct file_operations libcfs_fops = {
	.owner			    = THIS_MODULE,
	.unlocked_ioctl	= libcfs_psdev_ioctl,
};

static struct miscdevice libcfs_dev = {
	.minor		= MISC_DYNAMIC_MINOR,
	.name			= "lnet",
	.fops			= &libcfs_fops,
};

This miscdevice struct is registered via misc_register in the __init libcfs_init function that runs when the module is being initialized.

Some information on Misc Devices, but essentially this is just registering a misc device with the Linux Kernel.

Back to the actual call-chain…​

libcfs_psdev_ioctl is registered as the unlocked ioctl handler. This calls libcfs_ioctl, again with cmd being the IOC_LIBCFS_PING opcode, and *uparam being a void pointer to the libcfs_ioctl_data struct from earlier.

libcfs_ioctl turns that libcfs_ioctl_data struct into a new struct libcfs_ioctl_hdr *hdr usable going forward. It then looks at the cmd opcode, and if it’s a DEBUG-related opcode, does some stuff, but normally just ends up calling blocking_notifier_call_chain, another Linux kernel function that ends up walking the list of ioctl handlers (notifier_block) on libcfs_ioctl_list and calls them with cmd, and hdr. Previous to this, our notifier lnet_ioctl_handler, containing a reference to the lnet_ioctl function should have been registered on this list.

Here’s a link to the lnet_ioctl function (Note how this is now over in lnet territory).

lnet_ioctl is a wrapper that handles some opcode types, like IOC_LIBCFS_CONFIGURE, IOC_LIBCFS_UNCONFIGURE, IOC_LIBCFS_ADD_NET, etc. But our IOC_LIBCFS_PING is not among these, so the default case is used — this calls LNetCtl(cmd, hdr). Here’s a link to that function, in api-ni.c: https://github.hpe.com/hpe/hpc-lus-filesystem/blob/cray-2.15/lnet/lnet/api-ni.c#L4080

It’s got cases for many more opcode types, and we finally see our opcode being handled here. This fills out an struct lnet_process_id id4 = {};:

id4.nid = data->ioc_nid;
id4.pid = data->ioc_u32[0];

Then, calls lnet_ping():

rc = lnet_ping(id4, &LNET_ANY_NID, timeout, data->ioc_pbuf1,
			       data->ioc_plen1 / sizeof(struct lnet_process_id));

That initializes and registers a "free-floating" Memory Descriptor (MD) struct (struct lnet_md md), a ping buffer (struct lnet_ping_buffer *pbuf). Part of the MD initialization sets the md.handler to the lnet_ping_event_handler function. lnet_ping calls LNetMDBind() to register this MD.

To read more about Memory Descriptors (MD) and Matching Entries (ME), see the Lustre Internals PDF on page 53: https://wiki.old.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf

Finally, a call to LNetGet() is made:

rc = LNetGet(lnet_nid_to_nid4(src_nid), pd.mdh, id,
		     LNET_RESERVED_PORTAL,
		     LNET_PROTO_PING_MATCHBITS, 0, false);

On the receiving side:

Look into lnet_ping_target_setup()

The GET comes in on the LNet reserved portal, with the match bits, which then gets matched to the MD. The reply is just the contents of that memory.

References

[1] T. Shanley, D. Anderson, R. Budruk, MindShare, Inc, PCI Express System Architecture, Addison-Wesley Professional, 2003. [E-book] Available: https://learning.oreilly.com/library/view/pci-express-system/0321156307/