Lustre Networking (LNET)
Original Lustre documentation is linked below.
LNET Configuration
This section assumes you’re using an already-configured Infiniband fabric with IP over InfiniBand (IPoIB). To see how to do this prerequisite step, view the InfiniBand Documentation.
Configure LNET, and add the ib0
physical interface as the o2ib
network
Load the lnet
kernel module
modprobe lnet
lnetctl lnet configure
lnetctl net add --net o2ib --if ib0
Bring up the LNET network using lctl
[root@mawenzi-06 ~]# lctl network up
LNET configured
Show the network using lnetctl
[root@mawenzi-01 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib
local NI(s):
- nid: 192.168.0.101@o2ib
status: up
interfaces:
0: ib0
Client LNET Optimizations
-
Network Checksums: These are used to protect against network corruption, but on a reliable, high-speed network like Slingshot or InfiniBand, there is little use for this. You’ll want to disable this if it’s not already. It’s disabled (
0
) by default.lctl set_param osc.<filesystem_name>*.checksums=0
-
You can check the checksum settings using the following example:
[ccarlson@n01 ~]$ sudo lctl get_param osc.aiholus1*.checksums osc.aiholus1-OST0000-osc-ffffa1584d9b1800.checksums=0 osc.aiholus1-OST0001-osc-ffffa1584d9b1800.checksums=0 osc.aiholus1-OST0002-osc-ffffa1584d9b1800.checksums=0 osc.aiholus1-OST0003-osc-ffffa1584d9b1800.checksums=0 osc.aiholus1-OST0004-osc-ffffa1584d9b1800.checksums=0 osc.aiholus1-OST0005-osc-ffffa1584d9b1800.checksums=0
-
-
Max RPCs In-Flight: This indicates the maximum number of remote procedural calls on the LNET. By default, this is
8
. For high-speed networks, increase this to64
.lctl set_param mdc.<filesystem_name>*.max_rpcs_in_flight=64
-
You can check the max RPCs in-flight using the following example:
[ccarlson@n01 ~]$ sudo lctl get_param osc.aiholus1*.max_rpcs_in_flight osc.aiholus1-OST0000-osc-ffffa1584d9b1800.max_rpcs_in_flight=64 osc.aiholus1-OST0001-osc-ffffa1584d9b1800.max_rpcs_in_flight=64 osc.aiholus1-OST0002-osc-ffffa1584d9b1800.max_rpcs_in_flight=64 osc.aiholus1-OST0003-osc-ffffa1584d9b1800.max_rpcs_in_flight=64 osc.aiholus1-OST0004-osc-ffffa1584d9b1800.max_rpcs_in_flight=64 osc.aiholus1-OST0005-osc-ffffa1584d9b1800.max_rpcs_in_flight=64 [ccarlson@n01 ~]$ sudo lctl get_param mdc.aiholus1*.max_rpcs_in_flight mdc.aiholus1-MDT0000-mdc-ffffa1584d9b1800.max_rpcs_in_flight=64 mdc.aiholus1-MDT0001-mdc-ffffa1584d9b1800.max_rpcs_in_flight=64
-
-
Max Pages per RPC: Defines the maximum RPC size sent from the client to the server. Default depends on Lustre version installed, but is around 256 (1MB) for Lustre 2.12, and 4096 (16MB) for Lustre 2.15. We’ll want this at
1024
.lctl set_param osc.<filesystem_name>*.max_pages_per_rpc=1024 lctl set_param mdc.<filesystem_name>*.max_pages_per_rpc=256
-
Max Dirty MB: Defines the max amount of dirty data (MB) in client memory that hasn’t yet been written, this can be increased based on the client’s memory capabilities. Default is
2000
.lctl set_param osc.<filesystem_name>*.max_dirty_mb=2000 lctl set_param mdc.<filesystem_name>*.max_dirty_mb=2000
-
Max Read-Ahead MB: Defines max data that can be pre-fetched by the client if a sequential read is detected on a file. Default is
64MiB
. We’ll want to set this to512MiB
.lctl set_param llite.<filesystem_name>*.max_read_ahead_mb=512 lctl set_param llite.<filesystem_name>*.max_read_ahead_per_file_mb=512
Delete LNET Network
Example existing LNET configuration:
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp
local NI(s):
- nid: 10.10.5.1@tcp
status: up
Delete the tcp
network:
lnetctl net del --net tcp
Persist LNET Configuration Between Boots
If you want your LNET configuration to persist after a reboot, you’ll need to write it to a persistent file, /etc/lnet.conf
.
Export an existing LNET configuration to /etc/lnet.conf
:
lnetctl export >> /etc/lnet.conf
Multirail Configuration
If you have multiple InfiniBand Channel Adapters, you’ll want to configure LNET to use them in a multirail configuration. This example shows all four ConnectX-6 cards being used. An example of a multirail LNET configuration might look like the following:
[ccarlson@n01 ior-4.0.0rc1]$ sudo lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib
local NI(s):
- nid: 10.10.5.1@o2ib
status: up
interfaces:
0: ib0
- nid: 10.10.5.21@o2ib
status: up
interfaces:
0: ib1
- nid: 10.10.5.41@o2ib
status: up
interfaces:
0: ib2
- nid: 10.10.5.61@o2ib
status: up
interfaces:
0: ib3
IB Device ARP Settings
Multiple network interfaces on the same node may cause issues with the OS returning the wrong hardware address for a requested IP.
Because o2iblnd
uses IPoIB, we can get the wrong address, degrading performance.
Below is an example where we have four Mellanox InfiniBand ConnectX-6 cards on a system, each with their own IP address. We’ll need to turn off ARP broadcasting on these:
[ccarlson@n01 ~]$ ip a | grep ib
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:d4:9a brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.10.5.1/16 brd 10.10.255.255 scope global noprefixroute ib0
15: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:74:66 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.10.5.21/16 brd 10.10.255.255 scope global noprefixroute ib1
16: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:74:2e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.10.5.41/16 brd 10.10.255.255 scope global noprefixroute ib2
17: ib3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:10:29:fe:80:00:00:00:00:00:00:88:e9:a4:ff:ff:60:c4:7e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.10.5.61/16 brd 10.10.255.255 scope global noprefixroute ib3
Use the following script to disable ARP broadcasting on all four cards:
#!/bin/bash
# Check we are running as root
USER_ID=$(id -u)
if [[ $USER_ID -ne 0 ]]; then
echo "Must run as root user"
exit 1
fi
SUBNET="10.10"
SUBNET_MASK="16"
IB0_IP="10.10.5.1"
IB1_IP="10.10.5.21"
IB2_IP="10.10.5.41"
IB3_IP="10.10.5.61"
#Setting ARP so it doesn't broadcast
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib0.arp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_announce=2
sysctl -w net.ipv4.conf.ib0.rp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_announce=2
sysctl -w net.ipv4.conf.ib1.rp_filter=0
sysctl -w net.ipv4.conf.ib2.arp_ignore=1
sysctl -w net.ipv4.conf.ib2.arp_filter=0
sysctl -w net.ipv4.conf.ib2.arp_announce=2
sysctl -w net.ipv4.conf.ib2.rp_filter=0
sysctl -w net.ipv4.conf.ib3.arp_ignore=1
sysctl -w net.ipv4.conf.ib3.arp_filter=0
sysctl -w net.ipv4.conf.ib3.arp_announce=2
sysctl -w net.ipv4.conf.ib3.rp_filter=0
ip neigh flush dev ib0
ip neigh flush dev ib1
ip neigh flush dev ib2
ip neigh flush dev ib3
echo 200 ib0 >> /etc/iproute2/rt_tables
echo 201 ib1 >> /etc/iproute2/rt_tables
echo 202 ib2 >> /etc/iproute2/rt_tables
echo 203 ib3 >> /etc/iproute2/rt_tables
ip route add $SUBNET/$SUBNET_MASK dev ib0 proto kernel scope link src $IB0_IP table ib0
ip route add $SUBNET/$SUBNET_MASK dev ib1 proto kernel scope link src $IB1_IP table ib1
ip route add $SUBNET/$SUBNET_MASK dev ib2 proto kernel scope link src $IB2_IP table ib2
ip route add $SUBNET/$SUBNET_MASK dev ib3 proto kernel scope link src $IB3_IP table ib3
ip rule add from $IB0_IP table ib0
ip rule add from $IB1_IP table ib1
ip rule add from $IB2_IP table ib2
ip rule add from $IB3_IP table ib3
ip route flush cache
PCIe Relaxed Ordering
If you’re using multiple IB CAs in a multirail configuration, you’ll need to set the PCI device ordering to relaxed instead of the default, which is strict. This "allows switches in the path between the Requester and Completer to reorder some transactions just received before others that were previously enqueued to reorder transactions" [1]. Read more about PCI Relaxed Ordering mechanisms here:
The following script can be run to set relaxed ordering on all discovered Mellanox devices using the Mellanox Software Tools (mst
) devices.
#!/bin/bash
# Sets device ordering to relaxed for multirail InfiniBand.
# Check we are running as root
USER_ID=$(id -u)
if [[ $USER_ID -ne 0 ]]; then
echo "Must run as root user"
exit 1
fi
# Start Mellanox Software Tools (MST)
mst start
# See what cards are available
MST_DEVICES=$(find /dev/mst/ -name "*pciconf*" | sort)
MADE_CHANGE=0
for MST_DEVICE in $MST_DEVICES; do
echo "Checking $MST_DEVICE..."
ORDERING=$(mlxconfig -d $MST_DEVICE q | grep "PCI_WR_ORDERING" | xargs | awk '{print $2}')
echo "$MST_DEVICE PCI write ordering: $ORDERING"
if [[ $ORDERING == *"0"* ]]; then
echo "Ordering set to strict. Setting to relaxed..."
mlxconfig -y -d $MST_DEVICE s PCI_WR_ORDERING=1
MADE_CHANGE=1
else
echo "Ordering already set to relaxed. Skipping."
fi
done
[[ $MADE_CHANGE -eq 1 ]] && echo "Made changes to PCI device ordering. Reboot the system for them to take effect." || echo "No changes made."
Example output:
[ccarlson@n01 ~]$ sudo ./set_relaxed_ordering.sh
[sudo] password for ccarlson:
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success
Checking /dev/mst/mt4123_pciconf0...
/dev/mst/mt4123_pciconf0 PCI write ordering: force_relax(1)
Ordering already set to relaxed. Skipping.
Checking /dev/mst/mt4123_pciconf1...
/dev/mst/mt4123_pciconf1 PCI write ordering: per_mkey(0)
Ordering set to strict. Setting to relaxed...
Device #1:
----------
Device type: ConnectX6
Name: MCX653105A-HDA_HPE_Ax
Description: HPE InfiniBand HDR/Ethernet 200Gb 1-port MCX653105A-HDAT QSFP56 x16 Adapter
Device: /dev/mst/mt4123_pciconf1
Configurations: Next Boot New
PCI_WR_ORDERING per_mkey(0) force_relax(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
Checking /dev/mst/mt4123_pciconf2...
/dev/mst/mt4123_pciconf2 PCI write ordering: per_mkey(0)
Ordering set to strict. Setting to relaxed...
Device #1:
----------
Device type: ConnectX6
Name: MCX653105A-HDA_HPE_Ax
Description: HPE InfiniBand HDR/Ethernet 200Gb 1-port MCX653105A-HDAT QSFP56 x16 Adapter
Device: /dev/mst/mt4123_pciconf2
Configurations: Next Boot New
PCI_WR_ORDERING per_mkey(0) force_relax(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
Checking /dev/mst/mt4123_pciconf3...
/dev/mst/mt4123_pciconf3 PCI write ordering: per_mkey(0)
Ordering set to strict. Setting to relaxed...
Device #1:
----------
Device type: ConnectX6
Name: MCX653105A-HDA_HPE_Ax
Description: HPE InfiniBand HDR/Ethernet 200Gb 1-port MCX653105A-HDAT QSFP56 x16 Adapter
Device: /dev/mst/mt4123_pciconf3
Configurations: Next Boot New
PCI_WR_ORDERING per_mkey(0) force_relax(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
Made changes to PCI device ordering. Reboot the system for them to take effect.
Tracing lctl ping Code Path for o2iblnd
lctl ping
begins in lctl.c,
and defines that command to call the jt_ptl_ping
function. That function is declared in
obdctl.h and defined in
portals.c.
In jt_ptl_ping
, a struct libcfs_ioctl_data
is initialized. Here’s the structure of libcfs_ioctl_data
struct libcfs_ioctl_data {
struct libcfs_ioctl_hdr ioc_hdr;
__u64 ioc_nid;
__u64 ioc_u64[1];
__u32 ioc_flags;
__u32 ioc_count;
__u32 ioc_net;
__u32 ioc_u32[7];
__u32 ioc_inllen1;
char *ioc_inlbuf1;
__u32 ioc_inllen2;
char *ioc_inlbuf2;
__u32 ioc_plen1; /* buffers in userspace */
void __user *ioc_pbuf1;
__u32 ioc_plen2; /* buffers in userspace */
void __user *ioc_pbuf2;
char ioc_bulk[0];
};
Lastly, l_ioctl
is called (defined here).
l_ioctl
first opens a file descriptor using open_ioc_dev(LNET_DEV_ID)
.
Then, calls ioctl(fd, opc, buf)
where fd
is our file descriptor
of the LNET device, opc
is IOC_LIBCFS_PING
, and buf
is our
libcfs_ioctl_data
struct.
Notes on ioctl registration
Before this ioctl
call is made, libcfs
should have already been registered as devices/files capable of receiving ioctl
calls when the modules were loaded.
There’s a struct ioc_dev
internally in libcfs
struct ioc_dev {
const char *dev_name;
int dev_fd;
};
This is filled out added to a static struct ioc_dev ioc_dev_list[10];
when register_ioc_dev()
is called.
When libcfs starts up as a kernel module, it creates a struct miscdevice
, with a pointer to struct file_operations libcfs_fops
, which in turn has a pointer to the libcfs_psdev_ioctl
function:
static const struct file_operations libcfs_fops = {
.owner = THIS_MODULE,
.unlocked_ioctl = libcfs_psdev_ioctl,
};
static struct miscdevice libcfs_dev = {
.minor = MISC_DYNAMIC_MINOR,
.name = "lnet",
.fops = &libcfs_fops,
};
This miscdevice
struct is registered via misc_register
in the __init libcfs_init
function that runs when the module is being initialized.
Some information on Misc Devices, but essentially this is just registering a misc device with the Linux Kernel.
Back to the actual call-chain…
libcfs_psdev_ioctl
is registered as the unlocked ioctl
handler.
This calls libcfs_ioctl
, again with cmd
being the IOC_LIBCFS_PING
opcode, and *uparam
being a void pointer to the libcfs_ioctl_data
struct from earlier.
libcfs_ioctl
turns that libcfs_ioctl_data
struct into a new struct libcfs_ioctl_hdr *hdr
usable going forward. It then looks at the cmd
opcode, and if it’s a DEBUG-related opcode, does some stuff, but normally just ends up calling blocking_notifier_call_chain
, another Linux kernel function that ends up walking the list of ioctl handlers (notifier_block
) on libcfs_ioctl_list
and calls them with cmd
, and hdr
. Previous to this, our notifier lnet_ioctl_handler
, containing a reference to the lnet_ioctl
function should have been registered on this list.
Here’s a link to the lnet_ioctl
function (Note how this is now over in lnet
territory).
lnet_ioctl
is a wrapper that handles some opcode types, like IOC_LIBCFS_CONFIGURE
, IOC_LIBCFS_UNCONFIGURE
, IOC_LIBCFS_ADD_NET
, etc. But our IOC_LIBCFS_PING
is not among these, so the default
case is used — this calls LNetCtl(cmd, hdr)
. Here’s a link to that function, in api-ni.c: https://github.hpe.com/hpe/hpc-lus-filesystem/blob/cray-2.15/lnet/lnet/api-ni.c#L4080
It’s got cases for many more opcode types, and we finally see our opcode being handled here. This fills out an struct lnet_process_id id4 = {};
:
id4.nid = data->ioc_nid;
id4.pid = data->ioc_u32[0];
Then, calls lnet_ping()
:
rc = lnet_ping(id4, &LNET_ANY_NID, timeout, data->ioc_pbuf1,
data->ioc_plen1 / sizeof(struct lnet_process_id));
That initializes and registers a "free-floating" Memory Descriptor (MD) struct (struct lnet_md md
), a ping buffer (struct lnet_ping_buffer *pbuf
). Part of the MD initialization sets the md.handler
to the lnet_ping_event_handler
function. lnet_ping
calls LNetMDBind()
to register this MD.
To read more about Memory Descriptors (MD) and Matching Entries (ME), see the Lustre Internals PDF on page 53: https://wiki.old.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf
Finally, a call to LNetGet()
is made:
rc = LNetGet(lnet_nid_to_nid4(src_nid), pd.mdh, id,
LNET_RESERVED_PORTAL,
LNET_PROTO_PING_MATCHBITS, 0, false);
On the receiving side:
Look into lnet_ping_target_setup()
The GET comes in on the LNet reserved portal, with the match bits, which then gets matched to the MD. The reply is just the contents of that memory.
References
[1] T. Shanley, D. Anderson, R. Budruk, MindShare, Inc, PCI Express System Architecture, Addison-Wesley Professional, 2003. [E-book] Available: https://learning.oreilly.com/library/view/pci-express-system/0321156307/