Lustre Client

Configure and Mount Lustre Filesystem

Using the Lustre Client Repo

Here is where you can find all the publicly-available Lustre builds, for both server and client:

Here we’ll be using Rocky Linux 8.6 as a base.

First, set your HTTP/s proxy information in your environment so you can reach the greater internet:

cat >> /etc/environment<< EOF
#Proxies for LR1
http_proxy="http://proxy.houston.hpecorp.net:8080/"
https_proxy="http://proxy.houston.hpecorp.net:8080/"
ftp_proxy="http://proxy.houston.hpecorp.net:8080/"
no_proxy="localhost,127.0.0.1,.us.cray.com,.americas.cray.com,.dev.cray.com,.eag.rdlabs.hpecorp.net"
EOF

Set the dnf proxy configuration * Disable GPG key checking for ease of use

cat >> /etc/dnf/dnf.conf<< EOF
[main]
gpgcheck=0
installonly_limit=3
clean_requirements_on_remove=True
best=True
skip_if_unavailable=False
proxy=http://proxy.houston.hpecorp.net:8080
EOF

Next, create a new repo file for the following repos:

  • Lustre client pieces

    • Modules built with MOFED Infiniband support

cat >> /etc/yum.repos.d/lustre.repo<< EOF
[lustre-client]
name=rl8.6-ib - Lustre
baseurl=https://downloads.whamcloud.com/public/lustre/lustre-2.15.1-ib/MOFED-5.6-2.0.9.0/el8.6/client/
gpgcheck=0
EOF

Now that you’ve added these repos, install the Lustre client using dnf

dnf install epel-release lustre-client -y --allowerasing

Reboot for changes to take effect, new patched kernel to be loaded, etc.

reboot

Enable Infiniband Card

Load the IP over Infiniband (ipoib) module, allowing us to assign our Infiniband device an IP address.

modprobe ib_ipoib

Assign a static IP address to the ib0 device, and set the link state to UP

ip addr add 192.168.0.103/24 dev ib0
ip link set dev ib0 up

Load the LNET module

modprobe -v lnet
A better way to do this persistently is to set the following fields in /etc/sysconfig/network-scripts/ifcfg-ib0.
ONBOOT=yes
BOOTPROTO=none
IPADDR=192.168.0.103
NETMASK=255.255.255.0

A prerequisite for this is to have the ib_ipoib module loaded, which can be done by adding an entry to /etc/modules-load.d/. While we’re here we can also add on-boot modprobing for lnet.

echo ib_ipoib > /etc/modules-load.d/ipoib.conf
echo lnet > /etc/modules-load.d/lnet.conf

Configure LNET, and add the ib0 physical interface as the o2ib network

lnetctl lnet configure
lnetctl net add --net o2ib --if ib0

Bring up the LNET network using lctl

[root@mawenzi-06 ~]# lctl network up
LNET configured

Mount the lustre filesystem using the Lustre Client

mkdir -p /mnt/lustre
mount -t lustre 192.168.0.103@o2ib:/lustre /mnt/lustre

Lustre Client Usage

Show Lustre mounts on the system

[root@mawenzi-06 ~]# mount -t lustre
192.168.0.101@o2ib:/lustre on /mnt/lustre type lustre (rw,seclabel,checksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt)

Show Lustre client version

  • By looking at /proc/fs/lustre/version

k8s-worker1-shira-mercury01:~ # cat /proc/fs/lustre/version
lustre: 2.15.0.7_rc2_cray_3_g412d1c5
  • By using lctl get_param version

root@o186i221:~# lctl get_param version
version=2.15.2.2_cray_189_gb367a17
version=lustre: 2.15.2.2_cray_189_gb367a17

Using Lustre client to check filesystem

Using lfs utility to view information about the filesystem.

lfs check: Display the status of MGTs, MDTs or OSTs (as specified in the command) or all the servers (MGTs, MDTs and OSTs).

k8s-worker1-shira-mercury01:~ # lfs check all
testfs-OST0000-osc-ffff9ea20bb2d800 active.
testfs-OST0001-osc-ffff9ea20bb2d800 active.
testfs-MDT0000-mdc-ffff9ea20bb2d800 active.
testfs-MDT0001-mdc-ffff9ea20bb2d800 active.
MGC7@kfi active.

lfs df: Report filesystem disk space usage or inodes usage of each MDS and all OSDs or a batch belonging to a specific pool.

k8s-worker1-shira-mercury01:~ # lfs df /lus
UUID                   1K-blocks        Used   Available Use% Mounted on
testfs-MDT0000_UUID  10037371136      518656 10036850432   1% /lus[MDT:0]
testfs-MDT0001_UUID  10037534976      184064 10037348864   1% /lus[MDT:1]
testfs-OST0000_UUID  14645113856  5222995968  9422115840  36% /lus[OST:0]
testfs-OST0001_UUID  14645118976   111712256 14533404672   1% /lus[OST:1]

filesystem_summary:  29290232832  5334708224 23955520512  19% /lus

lctl pool_list <filesystem>: Show pools for a Lustre filesystem.

root@o186i221:~/ccarlson/experiments# lctl pool_list cstor1
Pools from cstor1:
cstor1.disk
cstor1.flash

lfs setstripe -c <count> -p <pool> <directory>: Create a directory and set it to only be on a Lustre pool.

mkdir /mnt/cstor1/ccarlson/flash
lfs setstripe -c 1 -p cstor1.flash /mnt/cstor1/ccarlson/flash

lfs getstripe <directory>: Show the striping of a file or directory on the Lustre filesystem.

root@o186i221:~/ccarlson/experiments# lfs getstripe /mnt/cstor1/ccarlson/flash
/mnt/cstor1/ccarlson/flash
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: -1 pool:          flash

Client Connectivity

Viewing client connectivity to MGS:

52a33fef-e9df-417c-98de-a811c4f36816:~ # for snid in $(lctl list_nids | xargs echo); do for dnid in 2586@kfi 2650@kfi 2651@kfi 2696@kfi ; do echo "$snid -> $dnid" ; lnetct
l ping --source $snid --timeout 127 $dnid ; done ; done
2079@kfi -> 2586@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2586@kfi
        - nid: 2650@kfi
2079@kfi -> 2650@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2586@kfi
        - nid: 2650@kfi
2079@kfi -> 2651@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2651@kfi
        - nid: 2696@kfi
2079@kfi -> 2696@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2651@kfi
        - nid: 2696@kfi
2270@kfi -> 2586@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2586@kfi
        - nid: 2650@kfi
2270@kfi -> 2650@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2586@kfi
        - nid: 2650@kfi
2270@kfi -> 2651@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2651@kfi
        - nid: 2696@kfi
2270@kfi -> 2696@kfi
ping:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer ni:
        - nid: 2651@kfi
        - nid: 2696@kfi

And viewing a single peer connection in high detail:

52a33fef-e9df-417c-98de-a811c4f36816:~ # lnetctl peer show -v 4 --nid 2586@kfi
peer:
    - primary nid: 2586@kfi
      Multi-Rail: True
      peer state: 273
      peer ni:
        - nid: 2586@kfi
          udsp info:
              net priority: -1
              nid priority: -1
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 127
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          refcount: 1
          statistics:
              send_count: 51
              recv_count: 51
              drop_count: 0
          sent_stats:
              put: 47
              get: 4
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 46
              get: 0
              reply: 4
              ack: 1
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
              ping_count: 0
              next_ping: 0
        - nid: 2650@kfi
          udsp info:
              net priority: -1
              nid priority: -1
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 127
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          refcount: 1
          statistics:
              send_count: 49
              recv_count: 48
              drop_count: 0
          sent_stats:
              put: 47
              get: 2
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 45
              get: 0
              reply: 2
              ack: 1
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
              ping_count: 0
              next_ping: 0
        - nid: 2651@kfi
          udsp info:
              net priority: -1
              nid priority: -1
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 127
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          refcount: 1
          statistics:
              send_count: 49
              recv_count: 3
              drop_count: 0
          sent_stats:
              put: 46
              get: 3
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 3
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
              ping_count: 0
              next_ping: 0
        - nid: 2696@kfi
          udsp info:
              net priority: -1
              nid priority: -1
          state: NA
          max_ni_tx_credits: 128
          available_tx_credits: 128
          min_tx_credits: 127
          tx_q_num_of_buf: 0
          available_rtr_credits: 128
          min_rtr_credits: 128
          refcount: 1
          statistics:
              send_count: 49
              recv_count: 3
              drop_count: 0
          sent_stats:
              put: 46
              get: 3
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 3
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
              ping_count: 0
              next_ping: 0

Replace Existing Lustre Client Installation

On Ubuntu 22.04:

  1. Show Lustre filesystem mounts

    root@o186i221:~# mount -t lustre
    172.22.184.42@o2ib:172.22.184.43@o2ib:/seagate on /cstor type lustre (rw,checksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt)
    172.22.187.183@o2ib,172.22.187.184@o2ib:172.22.187.185@o2ib,172.22.187.186@o2ib:/cstor1 on /mnt/cstor1 type lustre (rw,checksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt)
  2. Unmount Lustre filesystems

    root@o186i221:~# umount -t lustre /mnt/cstor1
    umount: /mnt/cstor1: target is busy.
    • Looks like something is using the filesystem. Find the PID of the processes using it and kill them:

      root@o186i221:~# lsof /mnt/cstor1
      COMMAND     PID USER   FD   TYPE     DEVICE SIZE/OFF               NODE NAME
      tmux:\x20 46615 root  cwd    DIR 778,293024     4096 144116096797005002 /mnt/cstor1/ssamar/flash/results/resnet50/host4_run1
      bash      46616 root  cwd    DIR 778,293024     4096 144116077788400128 /mnt/cstor1/ssamar/flash/resnet50/HPE/benchmarks/resnet/implementations/mxnet
      bash      48218 root  cwd    DIR 778,293024     4096 144116077788400128 /mnt/cstor1/ssamar/flash/resnet50/HPE/benchmarks/resnet/implementations/mxnet
      bash      48325 root  cwd    DIR 778,293024     4096 144116077788400128 /mnt/cstor1/ssamar/flash/resnet50/HPE/benchmarks/resnet/implementations/mxnet
      root@o186i221:~# kill 46615 46616 48218 48325
    • Now, retry the unmount operation and double check no more Lustre-typed filesystems are mounted.

      root@o186i221:~# umount -t lustre /mnt/cstor1
      root@o186i221:~# umount -t lustre /cstor
      root@o186i221:~# mount -t lustre
      root@o186i221:~#

3.

Building the Lustre Client

Persistent Client Cache (PCC)

PCC Prerequisites

Make sure you have Lustre client modules installed and LNET is up and running.

lnetctl lnet configure
lnetctl net add --net o2ib --if ib0
lctl network up

Make sure you have the Lustre filesystem mounted

mount -t lustre 192.168.0.101@o2ib:/lustre /mnt/lustre

PCC Installation

Create a clean ext4 partition on an NVMe drive. This is where the PCC stuff will live.

Here, I’m using fdisk /dev/nvme1n1 to create a new partition spanning the size of the disk.

[root@mawenzi-07 ~]# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0          11:0    1  2.1G  0 rom
nvme1n1     259:0    0  1.5T  0 disk
└─nvme1n1p1 259:9    0  1.5T  0 part
nvme0n1     259:1    0  1.5T  0 disk
├─nvme0n1p1 259:2    0  600M  0 part /boot/efi
├─nvme0n1p2 259:3    0    1G  0 part /boot
└─nvme0n1p3 259:4    0   74G  0 part
  ├─rl-root 253:0    0   70G  0 lvm  /
  └─rl-swap 253:1    0    4G  0 lvm  [SWAP]
nvme2n1     259:5    0  1.5T  0 disk
nvme3n1     259:6    0  1.5T  0 disk
nvme4n1     259:7    0  1.5T  0 disk

Then, make an ext4 filesystem on that partition:

[root@mawenzi-07 ~]# mkfs -t ext4 /dev/nvme1n1p1
mke2fs 1.45.6 (20-Mar-2020)
Discarding device blocks: done
Creating filesystem with 390703190 4k blocks and 97681408 inodes
Filesystem UUID: 792ae761-b8cb-4e60-91e4-ab991b3a9f0b
Superblock backups stored on blocks:
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
	102400000, 214990848

Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

Mount the partition to /mnt/pcc

[root@mawenzi-07 ~]# mount -t ext4 /dev/nvme1n1p1 /mnt/pcc

Launch a new installation of a Hierarchical Storage Manager (HSM) daemon with the serially next-available client ID. In this case, we already have 2 other PCC clients so we need to use an ID of 3.

lhsmtool_posix --daemon --hsm-root /mnt/pcc --archive=3 /mnt/lustre < /dev/null > /tmp/copytool_log 2>&1

Use lctl to add the /mnt/pcc PCC backend to the client. Here we specify a paramter list using -p:

  • uid={0} means auto-cache anything written by the root user.

  • rwid=3 means use the archive with ID 3, which is what we just created using lhsmtool.

lctl pcc add /mnt/lustre /mnt/pcc --param "uid={0} rwid=3"

Now, test PCC by creating a new file with some junk text:

[root@mawenzi-07 ~]# echo "QQQQQ" > /mnt/lustre/test2
[root@mawenzi-07 ~]# lfs pcc state /mnt/lustre/test2
file: /mnt/lustre/test2, type: readwrite, PCC file: /0002/0000/13aa/0000/0002/0000/0x2000013aa:0x2:0x0, user number: 0, flags: 0

You can view the PCC file by looking under the PCC path /mnt/pcc:

[root@mawenzi-07 ~]# xxd /mnt/pcc/0002/0000/13aa/0000/0002/0000/0x2000013aa\:0x2\:0x0
00000000: 5151 5151 510a                           QQQQQ.

Client Benchmarks

Preliminary experimental benchmarks involving both PCC and non-PCC Lustre clients by Abhinav Vemulapalli:

Talk by John Fragalla regarding Lustre benchmarking:

Non-PCC Benchmarks

dd

See dd documentation for a better overview of this tool.

Create a script dd_benchmark.sh with the following contents

#!/bin/bash

for aa in {1..5}; do
    dd if=/dev/zero of=/mnt/lustre/file$aa bs=4k iflag=fullblock,count_bytes count=50G
    rm -f file$aa
done

This copies 50GiB of zeroes to /mnt/lustre/fileX in 4k blocks.

Running this should produce the following:

[root@mawenzi-06 ~]# ./dd_benchmark.sh
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 118.528 s, 453 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 146.544 s, 366 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 125.689 s, 427 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 138.86 s, 387 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 136.06 s, 395 MB/s

fio

fio --name benchmark1 --filename=/lus/aiholus1/disk/ccarlson/testfile --rw=read --size=128g --blocksize=1024k --ioengine=libaio --direct=1 --numjobs=1

IOzone

/opt/iozone/bin/iozone -Ra -g 150G -b pcc-iozone-output.wks -i 0 -f /mnt/lustre/iozone-benchmarking

Lustre IOR

/usr/lib64/openmpi/bin/mpirun --allow-run-as-root -n 8 /usr/local/bin/ior -v -t 1m -b 32g -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S"` -F -C -e

IOR options

  • -t: Transfer size

  • -v: Verbose

  • -b: Block size (how big each file is that gets created)

  • -o: Output file name/path

  • -F: File-per-process, instead of single shared file

  • -C: Client-side read caching, force each MPI process to read the data written by its neighboring node

  • -e: Issue an fsync() call immediately after all of the write()s return to force the dirty pages we just wrote to flush out to Lustre

Lustre Client Tunings

Here’s a script to tune a Lustre client for an E1000 filesystem.

#!/bin/bash

# mdc: metadata client
# osc: object storage client

# Disable checksums on mdc and osc
lctl set_param osc.cstor1*.checksums=0
lctl set_param mdc.cstor1*.checksums=0

# Increase RPCs in flight limit for mdc and osc
lctl set_param osc.cstor1*.max_rpcs_in_flight=256
lctl set_param mdc.cstor1*.max_rpcs_in_flight=256

# Enable 16MB RPCs for osc, and 1MB RPCs for mdc
lctl set_param osc.cstor1*.max_pages_per_rpc=4096
lctl set_param mdc.cstor1*.max_pages_per_rpc=256

# Set 2GB limit on max dirty RPCs for osc and mdc
lctl set_param osc.cstor1*.max_dirty_mb=2000
lctl set_param mdc.cstor1*.max_dirty_mb=2000

# Set read-ahead tunings
lctl set_param llite.cstor1*.max_read_ahead_mb=512
lctl set_param llite.cstor1*.max_read_ahead_per_file_mb=512

To see the current client tunings:

#!/bin/bash

# mdc: metadata client
# osc: object storage client

# Get checksums on mdc and osc
lctl get_param osc.cstor1*.checksums
lctl get_param mdc.cstor1*.checksums

# Get RPCs in flight limit for mdc and osc
lctl get_param osc.cstor1*.max_rpcs_in_flight
lctl get_param mdc.cstor1*.max_rpcs_in_flight

# Get RPCs for osc and mdc
lctl get_param osc.cstor1*.max_pages_per_rpc
lctl get_param mdc.cstor1*.max_pages_per_rpc

# Get limit on max dirty RPCs for osc and mdc
lctl get_param osc.cstor1*.max_dirty_mb
lctl get_param mdc.cstor1*.max_dirty_mb

# Get read-ahead tunings
lctl get_param llite.cstor1*.max_read_ahead_mb
lctl get_param llite.cstor1*.max_read_ahead_per_file_mb