Lustre Client
Configure and Mount Lustre Filesystem
Using the Lustre Client Repo
Here is where you can find all the publicly-available Lustre builds, for both server and client:
Here we’ll be using Rocky Linux 8.6 as a base.
First, set your HTTP/s proxy information in your environment so you can reach the greater internet:
cat >> /etc/environment<< EOF
#Proxies for LR1
http_proxy="http://proxy.houston.hpecorp.net:8080/"
https_proxy="http://proxy.houston.hpecorp.net:8080/"
ftp_proxy="http://proxy.houston.hpecorp.net:8080/"
no_proxy="localhost,127.0.0.1,.us.cray.com,.americas.cray.com,.dev.cray.com,.eag.rdlabs.hpecorp.net"
EOF
Set the dnf
proxy configuration
* Disable GPG key checking for ease of use
cat >> /etc/dnf/dnf.conf<< EOF
[main]
gpgcheck=0
installonly_limit=3
clean_requirements_on_remove=True
best=True
skip_if_unavailable=False
proxy=http://proxy.houston.hpecorp.net:8080
EOF
Next, create a new repo file for the following repos:
-
Lustre client pieces
-
Modules built with MOFED Infiniband support
-
cat >> /etc/yum.repos.d/lustre.repo<< EOF
[lustre-client]
name=rl8.6-ib - Lustre
baseurl=https://downloads.whamcloud.com/public/lustre/lustre-2.15.1-ib/MOFED-5.6-2.0.9.0/el8.6/client/
gpgcheck=0
EOF
Now that you’ve added these repos, install the Lustre client using dnf
dnf install epel-release lustre-client -y --allowerasing
Reboot for changes to take effect, new patched kernel to be loaded, etc.
reboot
Enable Infiniband Card
Load the IP over Infiniband (ipoib
) module, allowing us to assign our Infiniband device an IP address.
modprobe ib_ipoib
Assign a static IP address to the ib0
device, and set the link state to UP
ip addr add 192.168.0.103/24 dev ib0
ip link set dev ib0 up
Load the LNET module
modprobe -v lnet
A better way to do this persistently is to set the following fields in /etc/sysconfig/network-scripts/ifcfg-ib0 .
|
ONBOOT=yes
BOOTPROTO=none
IPADDR=192.168.0.103
NETMASK=255.255.255.0
A prerequisite for this is to have the ib_ipoib
module loaded, which can be done by adding an entry to /etc/modules-load.d/
.
While we’re here we can also add on-boot modprobing for lnet
.
echo ib_ipoib > /etc/modules-load.d/ipoib.conf
echo lnet > /etc/modules-load.d/lnet.conf
Configure LNET, and add the ib0
physical interface as the o2ib
network
lnetctl lnet configure
lnetctl net add --net o2ib --if ib0
Bring up the LNET network using lctl
[root@mawenzi-06 ~]# lctl network up
LNET configured
Mount the lustre
filesystem using the Lustre Client
mkdir -p /mnt/lustre
mount -t lustre 192.168.0.103@o2ib:/lustre /mnt/lustre
Lustre Client Usage
Show Lustre mounts on the system
[root@mawenzi-06 ~]# mount -t lustre
192.168.0.101@o2ib:/lustre on /mnt/lustre type lustre (rw,seclabel,checksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt)
Show Lustre client version
-
By looking at
/proc/fs/lustre/version
k8s-worker1-shira-mercury01:~ # cat /proc/fs/lustre/version
lustre: 2.15.0.7_rc2_cray_3_g412d1c5
-
By using
lctl get_param version
root@o186i221:~# lctl get_param version
version=2.15.2.2_cray_189_gb367a17
version=lustre: 2.15.2.2_cray_189_gb367a17
Using Lustre client to check filesystem
Using lfs
utility to view information about the filesystem.
lfs check
: Display the status of MGTs, MDTs or OSTs (as specified in the command)
or all the servers (MGTs, MDTs and OSTs).
k8s-worker1-shira-mercury01:~ # lfs check all
testfs-OST0000-osc-ffff9ea20bb2d800 active.
testfs-OST0001-osc-ffff9ea20bb2d800 active.
testfs-MDT0000-mdc-ffff9ea20bb2d800 active.
testfs-MDT0001-mdc-ffff9ea20bb2d800 active.
MGC7@kfi active.
lfs df
: Report filesystem disk space usage or inodes usage of each MDS and all OSDs
or a batch belonging to a specific pool.
k8s-worker1-shira-mercury01:~ # lfs df /lus
UUID 1K-blocks Used Available Use% Mounted on
testfs-MDT0000_UUID 10037371136 518656 10036850432 1% /lus[MDT:0]
testfs-MDT0001_UUID 10037534976 184064 10037348864 1% /lus[MDT:1]
testfs-OST0000_UUID 14645113856 5222995968 9422115840 36% /lus[OST:0]
testfs-OST0001_UUID 14645118976 111712256 14533404672 1% /lus[OST:1]
filesystem_summary: 29290232832 5334708224 23955520512 19% /lus
lctl pool_list <filesystem>
: Show pools for a Lustre filesystem.
root@o186i221:~/ccarlson/experiments# lctl pool_list cstor1
Pools from cstor1:
cstor1.disk
cstor1.flash
lfs setstripe -c <count> -p <pool> <directory>
: Create a directory and set it to only be on a Lustre pool.
mkdir /mnt/cstor1/ccarlson/flash
lfs setstripe -c 1 -p cstor1.flash /mnt/cstor1/ccarlson/flash
lfs getstripe <directory>
: Show the striping of a file or directory on the Lustre filesystem.
root@o186i221:~/ccarlson/experiments# lfs getstripe /mnt/cstor1/ccarlson/flash
/mnt/cstor1/ccarlson/flash
stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 pool: flash
Client Connectivity
Viewing client connectivity to MGS:
52a33fef-e9df-417c-98de-a811c4f36816:~ # for snid in $(lctl list_nids | xargs echo); do for dnid in 2586@kfi 2650@kfi 2651@kfi 2696@kfi ; do echo "$snid -> $dnid" ; lnetct
l ping --source $snid --timeout 127 $dnid ; done ; done
2079@kfi -> 2586@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2586@kfi
- nid: 2650@kfi
2079@kfi -> 2650@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2586@kfi
- nid: 2650@kfi
2079@kfi -> 2651@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2651@kfi
- nid: 2696@kfi
2079@kfi -> 2696@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2651@kfi
- nid: 2696@kfi
2270@kfi -> 2586@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2586@kfi
- nid: 2650@kfi
2270@kfi -> 2650@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2586@kfi
- nid: 2650@kfi
2270@kfi -> 2651@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2651@kfi
- nid: 2696@kfi
2270@kfi -> 2696@kfi
ping:
- primary nid: 2586@kfi
Multi-Rail: True
peer ni:
- nid: 2651@kfi
- nid: 2696@kfi
And viewing a single peer connection in high detail:
52a33fef-e9df-417c-98de-a811c4f36816:~ # lnetctl peer show -v 4 --nid 2586@kfi
peer:
- primary nid: 2586@kfi
Multi-Rail: True
peer state: 273
peer ni:
- nid: 2586@kfi
udsp info:
net priority: -1
nid priority: -1
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 127
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
refcount: 1
statistics:
send_count: 51
recv_count: 51
drop_count: 0
sent_stats:
put: 47
get: 4
reply: 0
ack: 0
hello: 0
received_stats:
put: 46
get: 0
reply: 4
ack: 1
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
ping_count: 0
next_ping: 0
- nid: 2650@kfi
udsp info:
net priority: -1
nid priority: -1
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 127
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
refcount: 1
statistics:
send_count: 49
recv_count: 48
drop_count: 0
sent_stats:
put: 47
get: 2
reply: 0
ack: 0
hello: 0
received_stats:
put: 45
get: 0
reply: 2
ack: 1
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
ping_count: 0
next_ping: 0
- nid: 2651@kfi
udsp info:
net priority: -1
nid priority: -1
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 127
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
refcount: 1
statistics:
send_count: 49
recv_count: 3
drop_count: 0
sent_stats:
put: 46
get: 3
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 3
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
ping_count: 0
next_ping: 0
- nid: 2696@kfi
udsp info:
net priority: -1
nid priority: -1
state: NA
max_ni_tx_credits: 128
available_tx_credits: 128
min_tx_credits: 127
tx_q_num_of_buf: 0
available_rtr_credits: 128
min_rtr_credits: 128
refcount: 1
statistics:
send_count: 49
recv_count: 3
drop_count: 0
sent_stats:
put: 46
get: 3
reply: 0
ack: 0
hello: 0
received_stats:
put: 0
get: 0
reply: 3
ack: 0
hello: 0
dropped_stats:
put: 0
get: 0
reply: 0
ack: 0
hello: 0
health stats:
health value: 1000
dropped: 0
timeout: 0
error: 0
network timeout: 0
ping_count: 0
next_ping: 0
Replace Existing Lustre Client Installation
On Ubuntu 22.04:
-
Show Lustre filesystem mounts
root@o186i221:~# mount -t lustre 172.22.184.42@o2ib:172.22.184.43@o2ib:/seagate on /cstor type lustre (rw,checksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt) 172.22.187.183@o2ib,172.22.187.184@o2ib:172.22.187.185@o2ib,172.22.187.186@o2ib:/cstor1 on /mnt/cstor1 type lustre (rw,checksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt)
-
Unmount Lustre filesystems
root@o186i221:~# umount -t lustre /mnt/cstor1 umount: /mnt/cstor1: target is busy.
-
Looks like something is using the filesystem. Find the PID of the processes using it and kill them:
root@o186i221:~# lsof /mnt/cstor1 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME tmux:\x20 46615 root cwd DIR 778,293024 4096 144116096797005002 /mnt/cstor1/ssamar/flash/results/resnet50/host4_run1 bash 46616 root cwd DIR 778,293024 4096 144116077788400128 /mnt/cstor1/ssamar/flash/resnet50/HPE/benchmarks/resnet/implementations/mxnet bash 48218 root cwd DIR 778,293024 4096 144116077788400128 /mnt/cstor1/ssamar/flash/resnet50/HPE/benchmarks/resnet/implementations/mxnet bash 48325 root cwd DIR 778,293024 4096 144116077788400128 /mnt/cstor1/ssamar/flash/resnet50/HPE/benchmarks/resnet/implementations/mxnet root@o186i221:~# kill 46615 46616 48218 48325
-
Now, retry the unmount operation and double check no more Lustre-typed filesystems are mounted.
root@o186i221:~# umount -t lustre /mnt/cstor1 root@o186i221:~# umount -t lustre /cstor root@o186i221:~# mount -t lustre root@o186i221:~#
-
3.
Persistent Client Cache (PCC)
PCC Prerequisites
Make sure you have Lustre client modules installed and LNET is up and running.
lnetctl lnet configure
lnetctl net add --net o2ib --if ib0
lctl network up
Make sure you have the Lustre filesystem mounted
mount -t lustre 192.168.0.101@o2ib:/lustre /mnt/lustre
PCC Installation
Create a clean ext4 partition on an NVMe drive. This is where the PCC stuff will live.
Here, I’m using fdisk /dev/nvme1n1
to create a new partition spanning the size of the disk.
[root@mawenzi-07 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 2.1G 0 rom
nvme1n1 259:0 0 1.5T 0 disk
└─nvme1n1p1 259:9 0 1.5T 0 part
nvme0n1 259:1 0 1.5T 0 disk
├─nvme0n1p1 259:2 0 600M 0 part /boot/efi
├─nvme0n1p2 259:3 0 1G 0 part /boot
└─nvme0n1p3 259:4 0 74G 0 part
├─rl-root 253:0 0 70G 0 lvm /
└─rl-swap 253:1 0 4G 0 lvm [SWAP]
nvme2n1 259:5 0 1.5T 0 disk
nvme3n1 259:6 0 1.5T 0 disk
nvme4n1 259:7 0 1.5T 0 disk
Then, make an ext4 filesystem on that partition:
[root@mawenzi-07 ~]# mkfs -t ext4 /dev/nvme1n1p1
mke2fs 1.45.6 (20-Mar-2020)
Discarding device blocks: done
Creating filesystem with 390703190 4k blocks and 97681408 inodes
Filesystem UUID: 792ae761-b8cb-4e60-91e4-ab991b3a9f0b
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
Mount the partition to /mnt/pcc
[root@mawenzi-07 ~]# mount -t ext4 /dev/nvme1n1p1 /mnt/pcc
Launch a new installation of a Hierarchical Storage Manager (HSM) daemon with the serially next-available client ID. In this case,
we already have 2 other PCC clients so we need to use an ID of 3
.
lhsmtool_posix --daemon --hsm-root /mnt/pcc --archive=3 /mnt/lustre < /dev/null > /tmp/copytool_log 2>&1
Use lctl
to add the /mnt/pcc
PCC backend to the client. Here we specify a paramter list using -p
:
-
uid={0}
means auto-cache anything written by the root user. -
rwid=3
means use the archive with ID 3, which is what we just created usinglhsmtool
.
lctl pcc add /mnt/lustre /mnt/pcc --param "uid={0} rwid=3"
Now, test PCC by creating a new file with some junk text:
[root@mawenzi-07 ~]# echo "QQQQQ" > /mnt/lustre/test2
[root@mawenzi-07 ~]# lfs pcc state /mnt/lustre/test2
file: /mnt/lustre/test2, type: readwrite, PCC file: /0002/0000/13aa/0000/0002/0000/0x2000013aa:0x2:0x0, user number: 0, flags: 0
You can view the PCC file by looking under the PCC path /mnt/pcc
:
[root@mawenzi-07 ~]# xxd /mnt/pcc/0002/0000/13aa/0000/0002/0000/0x2000013aa\:0x2\:0x0
00000000: 5151 5151 510a QQQQQ.
Client Benchmarks
Preliminary experimental benchmarks involving both PCC and non-PCC Lustre clients by Abhinav Vemulapalli:
Talk by John Fragalla regarding Lustre benchmarking:
Non-PCC Benchmarks
dd
See dd
documentation for a better overview of this tool.
Create a script dd_benchmark.sh
with the following contents
#!/bin/bash
for aa in {1..5}; do
dd if=/dev/zero of=/mnt/lustre/file$aa bs=4k iflag=fullblock,count_bytes count=50G
rm -f file$aa
done
This copies 50GiB of zeroes to /mnt/lustre/fileX
in 4k blocks.
Running this should produce the following:
[root@mawenzi-06 ~]# ./dd_benchmark.sh
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 118.528 s, 453 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 146.544 s, 366 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 125.689 s, 427 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 138.86 s, 387 MB/s
13107200+0 records in
13107200+0 records out
53687091200 bytes (54 GB, 50 GiB) copied, 136.06 s, 395 MB/s
fio
fio --name benchmark1 --filename=/lus/aiholus1/disk/ccarlson/testfile --rw=read --size=128g --blocksize=1024k --ioengine=libaio --direct=1 --numjobs=1
IOzone
/opt/iozone/bin/iozone -Ra -g 150G -b pcc-iozone-output.wks -i 0 -f /mnt/lustre/iozone-benchmarking
Lustre IOR
/usr/lib64/openmpi/bin/mpirun --allow-run-as-root -n 8 /usr/local/bin/ior -v -t 1m -b 32g -o /mnt/lustre/test.`date +"%Y%m%d.%H%M%S"` -F -C -e
IOR options
-
-t
: Transfer size -
-v
: Verbose -
-b
: Block size (how big each file is that gets created) -
-o
: Output file name/path -
-F
: File-per-process, instead of single shared file -
-C
: Client-side read caching, force each MPI process to read the data written by its neighboring node -
-e
: Issue an fsync() call immediately after all of the write()s return to force the dirty pages we just wrote to flush out to Lustre
Lustre Client Tunings
Here’s a script to tune a Lustre client for an E1000 filesystem.
#!/bin/bash
# mdc: metadata client
# osc: object storage client
# Disable checksums on mdc and osc
lctl set_param osc.cstor1*.checksums=0
lctl set_param mdc.cstor1*.checksums=0
# Increase RPCs in flight limit for mdc and osc
lctl set_param osc.cstor1*.max_rpcs_in_flight=256
lctl set_param mdc.cstor1*.max_rpcs_in_flight=256
# Enable 16MB RPCs for osc, and 1MB RPCs for mdc
lctl set_param osc.cstor1*.max_pages_per_rpc=4096
lctl set_param mdc.cstor1*.max_pages_per_rpc=256
# Set 2GB limit on max dirty RPCs for osc and mdc
lctl set_param osc.cstor1*.max_dirty_mb=2000
lctl set_param mdc.cstor1*.max_dirty_mb=2000
# Set read-ahead tunings
lctl set_param llite.cstor1*.max_read_ahead_mb=512
lctl set_param llite.cstor1*.max_read_ahead_per_file_mb=512
To see the current client tunings:
#!/bin/bash
# mdc: metadata client
# osc: object storage client
# Get checksums on mdc and osc
lctl get_param osc.cstor1*.checksums
lctl get_param mdc.cstor1*.checksums
# Get RPCs in flight limit for mdc and osc
lctl get_param osc.cstor1*.max_rpcs_in_flight
lctl get_param mdc.cstor1*.max_rpcs_in_flight
# Get RPCs for osc and mdc
lctl get_param osc.cstor1*.max_pages_per_rpc
lctl get_param mdc.cstor1*.max_pages_per_rpc
# Get limit on max dirty RPCs for osc and mdc
lctl get_param osc.cstor1*.max_dirty_mb
lctl get_param mdc.cstor1*.max_dirty_mb
# Get read-ahead tunings
lctl get_param llite.cstor1*.max_read_ahead_mb
lctl get_param llite.cstor1*.max_read_ahead_per_file_mb