GPUDirect Storage (GDS)

This document provides notes for installing and using Nvidia’s GPUDirect Storage.

Nvidia / CUDA Installation with GDS

To use NVIDIA CUDA on your system, you will need the following installed: CUDA Toolkit (available at https://developer.nvidia.com/cuda-downloads)

It’s easiest to just install the CUDA toolkit and drivers, along with nvidia-fs via apt which includes GDS.

If you’re using InfiniBand, you need to have MOFED installed prior to installing GDS.

Ubuntu 22.04

#!/bin/bash

apt-get install \
	cuda-drivers-535="535.104.05-1" \
	cuda-drivers-fabricmanager-535="535.104.05-1" \
	cuda-drivers-fabricmanager="535.104.05-1" \
	libnvidia-cfg1-535="535.104.05-0ubuntu1" \
	libnvidia-common-535="535.104.05-0ubuntu1" \
	libnvidia-compute-535="535.104.05-0ubuntu1" \
	libnvidia-container-tools="1.14.1-1" \
	libnvidia-container1="1.14.1-1" \
	libnvidia-decode-535="535.104.05-0ubuntu1" \
	libnvidia-encode-535="535.104.05-0ubuntu1" \
	libnvidia-extra-535="535.104.05-0ubuntu1" \
	libnvidia-fbc1-535="535.104.05-0ubuntu1" \
	libnvidia-gl-535="535.104.05-0ubuntu1" \
	nvidia-compute-utils-535="535.104.05-0ubuntu1" \
	nvidia-container-toolkit-base="1.14.1-1" \
	nvidia-container-toolkit="1.14.1-1" \
	nvidia-dkms-535="535.104.05-0ubuntu1" \
	nvidia-docker2="2.13.0-1" \
	nvidia-driver-535="535.104.05-0ubuntu1" \
	nvidia-fabricmanager-535="535.104.05-1" \
	nvidia-fs-dkms="2.17.5-1" \
	nvidia-fs="2.17.5-1" \
	nvidia-gds-12-2="12.2.2-1" \
	nvidia-gds="12.2.2-1" \
	nvidia-kernel-common-535="535.104.05-0ubuntu1" \
	nvidia-kernel-source-535="535.104.05-0ubuntu1" \
	nvidia-modprobe="535.104.05-0ubuntu1" \
	nvidia-prime="0.8.17.1" \
	nvidia-settings="535.104.05-0ubuntu1" \
	nvidia-utils-535="535.104.05-0ubuntu1" \
	xserver-xorg-video-nvidia-535="535.104.05-0ubuntu1"

Installing Open-Source Nvidia Drivers

For Ubuntu 22.04, follow Nvidia CUDA Toolkit Installation Guide.

Base installer installation instructions for CUDA 12.3:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

NVIDIA driver installation instructions:

sudo apt-get install -y nvidia-kernel-open-545
sudo apt-get install -y cuda-drivers-545

RHEL 8.6

You’ll need to have the appropriate CUDA toolkit, appropriate Nvidia drivers, and CUDA drivers installed as a prerequisite.

Below is a list of all the Nvidia-related bits installed on a working GDS-enabled Rocky Linux 8.6 system:

[carlsonc@aixl675dn04 ~]$ dnf list --installed "*nvidia*" "*cuda*" "*gds*"
Installed Packages
cuda-crt-12-3.x86_64                                                             12.3.103-1                                                    @cuda-rhel8-x86_64
cuda-driver-devel-12-1.x86_64                                                    12.1.105-1                                                    @cuda-rhel8-12-1-local
cuda-driver-devel-12-3.x86_64                                                    12.3.101-1                                                    @cuda-rhel8-x86_64
cuda-drivers.x86_64                                                              545.23.08-1                                                   @cuda-rhel8-x86_64
cuda-nvcc-12-3.x86_64                                                            12.3.103-1                                                    @cuda-rhel8-x86_64
cuda-nvvm-12-3.x86_64                                                            12.3.103-1                                                    @cuda-rhel8-x86_64
cuda-repo-rhel8-12-1-local.x86_64                                                12.1.1_530.30.02-1                                            @System
cuda-toolkit-12-3-config-common.noarch                                           12.3.101-1                                                    @cuda-rhel8-x86_64
cuda-toolkit-12-config-common.noarch                                             12.1.105-1                                                    @cuda-rhel8-12-1-local
cuda-toolkit-config-common.noarch                                                12.1.105-1                                                    @cuda-rhel8-12-1-local
dnf-plugin-nvidia.noarch                                                         2.0-1.el8                                                     @nvidia_cuda-rhel8-11.8.0_13
gds-partners.x86_64                                                              1.8.0-1                                                       @System
gds-tools-12-3.x86_64                                                            1.8.1.2-1                                                     @cuda-rhel8-x86_64
kmod-nvidia-latest-dkms.x86_64                                                   3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
libnvidia-container-tools.x86_64                                                 1.14.3-1                                                      @cuda-rhel8-x86_64
libnvidia-container1.x86_64                                                      1.14.3-1                                                      @cuda-rhel8-x86_64
nvidia-container-toolkit.x86_64                                                  1.14.3-1                                                      @cuda-rhel8-x86_64
nvidia-container-toolkit-base.x86_64                                             1.14.3-1                                                      @cuda-rhel8-x86_64
nvidia-driver.x86_64                                                             3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-driver-NVML.x86_64                                                        3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-driver-NvFBCOpenGL.x86_64                                                 3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-driver-cuda.x86_64                                                        3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-driver-cuda-libs.x86_64                                                   3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-driver-devel.x86_64                                                       3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-driver-libs.x86_64                                                        3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-fabric-manager.x86_64                                                     545.23.08-1                                                   @cuda-rhel8-x86_64
nvidia-fs.x86_64                                                                 2.18.3-1                                                      @cuda-rhel8-x86_64
nvidia-fs-dkms.x86_64                                                            2.18.3-1                                                      @cuda-rhel8-x86_64
nvidia-gds.x86_64                                                                12.3.1-1                                                      @cuda-rhel8-x86_64
nvidia-gds-12-3.x86_64                                                           12.3.1-1                                                      @cuda-rhel8-x86_64
nvidia-kmod-common.noarch                                                        3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-libXNVCtrl.x86_64                                                         3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-libXNVCtrl-devel.x86_64                                                   3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-modprobe.x86_64                                                           3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-persistenced.x86_64                                                       3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-settings.x86_64                                                           3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia-xconfig.x86_64                                                            3:545.23.08-1.el8                                             @cuda-rhel8-x86_64
nvidia_peer_memory.x86_64                                                        1.1-0                                                         @@System

NOTE: This might be more than the minimum you need to get GDS up and running, but we know for certain that things work with all this installed.

This is what’s in the cuda-rhel8-x86_64 repofile:

[carlsonc@aixl675dn04 ~]$ cat /etc/yum.repos.d/cuda-rhel8.repo
[cuda-rhel8-x86_64]
name=cuda-rhel8-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/D42D0685.pub

Verifying Installation

Verify your GDS installation works on your platform with Lustre by running /usr/local/cuda-<version>/gds/tools/gdscheck -p:

Example
ccarlson@o186i225:~$ /usr/local/cuda-12.4/gds/tools/gdscheck -p
 GDS release version: 1.9.0.20
 nvidia_fs version:  2.19 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 CUFILE_ENV_PATH_JSON : /home/ccarlson/gds/cufile.json
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Supported
 IBM Spectrum Scale : Unsupported
 NFS                : Supported
 BeeGFS             : Unsupported
 WekaFS             : Supported
 Userspace RDMA     : Supported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Loaded (libcufile_rdma.so)
 --rdma devices        : Configured
 --rdma_device_status  : Up: 4 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : false
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4096
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobinMaxMin
 properties.rdma_dynamic_routing : 1
 properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.lustre.mount_table :
 /e1000  dev_id 1199770272 : 172.22.186.225 172.22.194.225 172.22.202.225 172.22.210.225
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 4 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 5 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 6 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 7 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12040
 Platform: ProLiant XL675d Gen10 Plus, Arch: x86_64(Linux 5.15.0-100-generic)
 Platform verification succeeded
  • Note the DDN EXAScaler : Supported: this indicates Lustre is supported.

You can check the GDS filesystem support by running /usr/local/cuda-<version>/gds/tools/gdscheck.py -V:

Example
FILESYSTEM VERSION CHECK:
Pre-requisite:
nvidia_peermem is loaded as required
nvme module is loaded
nvme module is not patched or not loaded
nvme-rdma module is not loaded
ScaleFlux module is not loaded
NVMesh module is not loaded
Lustre module is loaded
Lustre module is correctly patched
BeeGFS module is loaded
BeeGFS module is not patched or not loaded
GPFS module is not loaded
rpcrdma module is loaded
rpcrdma module is correctly patched
Lustre:
current version: 2.15.4 (Supported)
min version supported: 2.12.3_ddn28
ofed_info:
current version: MLNX_OFED_LINUX-24.01-0.3.3.1: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1

Removing CUDA Toolkit and Drivers

Lustre Filesystem Benchmarks

We’ll be benchmarking GDS against a Lustre filesystem over RDMA.

Make sure to have tuned your Lustre client before starting experiments, otherwise performance will be impacted.

Configuration

Lustre Filesystem

The Lustre filesystem is being served from 4 InfiniBand HDR High-speed Channel Adapters (HCAs):

ccarlson@o186i225:~$ mount -t lustre
172.22.184.42@o2ib:172.22.184.43@o2ib:/seagate on /cstor type lustre (rw,nochecksum,flock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt)
172.22.187.183@o2ib,172.22.187.184@o2ib:172.22.187.185@o2ib,172.22.187.186@o2ib:/cstor1 on /e1000 type lustre (rw,nochecksum,noflock,nouser_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,encrypt)

The filesystem server layout is as follows:

  • There are two MDTs, both flash-based

  • There are two OSTs which are flash based, the remaining four are disk-based.

ccarlson@o186i225:~$ lfs df /e1000
UUID                   1K-blocks        Used   Available Use% Mounted on
cstor1-MDT0000_UUID  12680735564     4607868 12505326916   1% /e1000[MDT:0]
cstor1-MDT0001_UUID  12680735564      424632 12509510152   1% /e1000[MDT:1]
cstor1-OST0000_UUID  65092016776 43759881420 20675704416  68% /e1000[OST:0]
cstor1-OST0001_UUID  65092016776 44857958320 19577627516  70% /e1000[OST:1]
cstor1-OST0002_UUID  624268859304 23013721776 594959810400   4% /e1000[OST:2]
cstor1-OST0003_UUID  624268859304 23178814712 594794717464   4% /e1000[OST:3]
cstor1-OST0004_UUID  624268859304 22864999736 595108532440   4% /e1000[OST:4]
cstor1-OST0005_UUID  624268859304 23018661112 594954871064   4% /e1000[OST:5]
cstor1-OST0006_UUID  624268859304 24916151264 593057380912   5% /e1000[OST:6]
cstor1-OST0007_UUID  624268859304 25205560176 592767972000   5% /e1000[OST:7]
cstor1-OST0008_UUID  624268859304 24378180764 593595351412   4% /e1000[OST:8]
cstor1-OST0009_UUID  624268859304 24641586624 593331945552   4% /e1000[OST:9]
cstor1-OST000a_UUID  32551608660  9894559412 22328769268  31% /e1000[OST:10]
cstor1-OST000b_UUID  32551608660  9907463744 22315864936  31% /e1000[OST:11]

filesystem_summary:  5189438125304 299637539060 4837468547380   6% /e1000

We’ve got a directory, /e1000/ccarlson, which has its stripe set to the flash pool only:

ccarlson@o186i225:~$ lfs getstripe /e1000/ccarlson
/e1000/ccarlson
stripe_count:  1 stripe_size:   1048576 pattern:       0 stripe_offset: -1

/e1000/ccarlson/fio
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: -1 pool:          flash

/e1000/ccarlson/ior
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: -1 pool:          flash

/e1000/ccarlson/gds
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: -1 pool:          flash

You can set the stripe to flash on a directory using the following command:

lfs setstripe -c 1 -p cstor1.flash /e1000/ccarlson/fio

Where cstor1 is the filesystem name and /e1000/ccarlson/fio is the directory.

cufile.json

We’ve defined a custom /home/hpcd/carlsonc/gdsio/cufile.json, using the default /etc/cufile.json provided as the starting point, and have populated it with the following parameters:

cufile.json
{
  // NOTE : Application can override custom configuration via export CUFILE_ENV_PATH_JSON=<filepath>
  // e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile.json"

  "logging": {
    // log directory, if not enabled will create log file under current working directory
    "dir": "/home/ccarlson/gds/cufile_logs",
    // NOTICE|ERROR|WARN|INFO|DEBUG|TRACE (in decreasing order of severity)
    "level": "INFO"
  },

  "profile": {
    // nvtx profiling on/off
    "nvtx": false,
    // cufile stats level(0-3)
    "cufile_stats": 0
  },

  "execution" : {
    // max number of workitems in the queue;
    "max_io_queue_depth": 128,
    // max number of host threads per gpu to spawn for parallel IO
    "max_io_threads" : 4,
    // enable support for parallel IO
    "parallel_io" : true,
    // minimum IO threshold before splitting the IO
    "min_io_threshold_size_kb" : 8192,
    // maximum parallelism for a single request
    "max_request_parallelism" : 4
  },

  "properties": {
    // max IO chunk size (parameter should be multiples of 64K) used by cuFileRead/Write internally per IO request
    "max_direct_io_size_kb" : 16384,
    // device memory size (parameter should be 4K aligned) for reserving bounce buffers for the entire GPU
    "max_device_cache_size_kb" : 131072,
    // limit on maximum device memory size (parameter should be 4K aligned) that can be pinned for a given process
    "max_device_pinned_mem_size_kb" : 33554432,
    // true or false (true will enable asynchronous io submission to nvidia-fs driver)
    // Note : currently the overall IO will still be synchronous
    "use_poll_mode" : false,
    // maximum IO request size (parameter should be 4K aligned) within or equal to which library will use polling for IO completion
    "poll_mode_max_size_kb": 4096,
    // allow compat mode, this will enable use of cuFile posix read/writes
    "allow_compat_mode": false,
    // enable GDS write support for RDMA based storage
    "gds_rdma_write_support": true,
    // GDS batch size
    "io_batchsize": 128,
    // enable io priority w.r.t compute streams
    // valid options are "default", "low", "med", "high"
    "io_priority": "high",
    // client-side rdma addr list for user-space file-systems(e.g ["10.0.1.0", "10.0.2.0"])
    "rdma_dev_addr_list": [ "172.22.186.225", "172.22.194.225", "172.22.202.225", "172.22.210.225" ],
    // load balancing policy for RDMA memory registration(MR), (RoundRobin, RoundRobinMaxMin)
    // In RoundRobin, MRs will be distributed uniformly across NICS closest to a GPU
    // In RoundRobinMaxMin, MRs will be distributed across NICS closest to a GPU
    // with minimal sharing of NICS acros GPUS
    "rdma_load_balancing_policy": "RoundRobinMaxMin",
    //32-bit dc key value in hex
    //"rdma_dc_key": "0xffeeddcc",
    //To enable/disable different rdma OPs use the below bit map
    //Bit 0 - If set enables Local RDMA WRITE
    //Bit 1 - If set enables Remote RDMA WRITE
    //Bit 2 - If set enables Remote RDMA READ
    //Bit 3 - If set enables REMOTE RDMA Atomics
    //Bit 4 - If set enables Relaxed ordering.
    //"rdma_access_mask": "0x1f",
    // In platforms where IO transfer to a GPU will cause cross RootPort PCie transfers, enabling this feature
    // might help improve overall BW provided there exists a GPU(s) with Root Port common to that of the storage NIC(s).
    // If this feature is enabled, please provide the ip addresses used by the mount either in file-system specific
    // section for mount_table or in the rdma_dev_addr_list property in properties section
    "rdma_dynamic_routing": true,
    // The order describes the sequence in which a policy is selected for dynamic routing for cross Root Port transfers
    // If the first policy is not applicable, it will fallback to the next and so on.
    // policy GPU_MEM_NVLINKS: use GPU memory with NVLink to transfer data between GPUs
    // policy GPU_MEM: use GPU memory with PCIe to transfer data between GPUs
    // policy SYS_MEM: use system memory with PCIe to transfer data to GPU
    // policy P2P: use P2P PCIe to transfer across between NIC and GPU
    "rdma_dynamic_routing_order": [ "GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P" ]
  },

  "fs": {
    "generic": {
      // for unaligned writes, setting it to true will, cuFileWrite use posix write internally instead of regular GDS write
      "posix_unaligned_writes" : false
    },
    "beegfs" : {},
    "lustre": {
      // IO threshold for read/write (param should be 4K aligned)) equal to or below which cuFile will use posix read/write
      "posix_gds_min_kb" : 0,
      // To restrict the IO to selected IP list, when dynamic routing is enabled
      // if using a single lustre mount, provide the ip addresses here (use : sudo lnetctl net show)
      // if using multiple lustre mounts, provide ip addresses used by respective mount here
      "mount_table" : {
        "/e1000" : {
	  "rdma_dev_addr_list": ["172.22.186.225", "172.22.194.225", "172.22.202.225", "172.22.210.225"]
	}
      }
    },
    "nfs": {},
    "gpfs": {},
    "weka": {}
  },

  "denylist": {
    // specify list of vendor driver modules to deny for nvidia-fs (e.g. ["nvme" , "nvme_rdma"])
    "drivers":  [],

    // specify list of block devices to prevent IO using cuFile (e.g. [ "/dev/nvme0n1" ])
    "devices": [],

    // specify list of mount points to prevent IO using cuFile (e.g. ["/mnt/test"])
    "mounts": [],

    // specify list of file-systems to prevent IO using cuFile (e.g ["lustre", "wekafs"])
    "filesystems": []
  },

  "miscellaneous": {
    // enable only for enforcing strict checks at API level for debugging
    "api_check_aggressive": false
  }
}

Some things to note about the above cufile.json:

  • We’ll be using this cuFile spec, overriding the default one, by exporting CUFILE_ENV_PATH_JSON="/home/ccarlson/gds/cufile.json" before our gdsio execution.

  • I’ve turned up logging to INFO and am outputting logs to my directory /home/ccarlson/gds/cufile_logs.

GDSIO Jobfile

Next up, we have our write.gdsio job file:

write.gdsio
#
# sample config file gdsio.
# config file rules :
#   -provide a global section defined with [global]
#   -provide a job(s) must follow this signature [job-name-xxx]
#   -use newline to mark end of each section except last
#   -for comments, add # to the start of a line
#
[global]
name=gds-write
#0 - Storage->GPU (GDS)
#1 - Storage->CPU
#2 - Storage->CPU->GPU
#3 - Storage->CPU->GPU_ASYNC
#4 - Storage->PAGE_CACHE->CPU->GPU
#5 - Storage->GPU_ASYNC
#6 - Storage->GPU_BATCH
#7 - Storage->GPU_BATCH_STREAM
xfer_type=0
#IO type, rw=read, rw=write, rw=randread, rw=randwrite
rw=write
#block size, for variable block size can specify range e.g. bs=1M:4M:1M, (1M : start block size, 4M : end block size, 1M :steps in which size is varied)
bs=4M
#file-size
size=4G
#secs
runtime=60
#use 1 for enabling verification
do_verify=0
#skip cufile buffer registration, ignored in cpu mode
skip_bufregister=0
#set up NVlinks, recommended if p2p traffic is cross node
enable_nvlinks=1
#use random seed
random_seed=0
#fill request buffer with random data
fill_random=0
#refill io buffer after every write
refill_buffer=0
#use random offsets which are not page-aligned
unaligned_random=0
#file offset to start read/write from
start_offset=0
#alignment size for random IO
#alignment_size=64K


[job0]
#numa node
#numa_node=0
#gpu device index (check nvidia-smi)
gpu_dev_id=0
#For Xfer mode 6, num_threads will be used as batch_size
num_threads=$NUM_GDS_THREADS_PER_GPU
#enable either directory or filename or url
directory=/e1000/ccarlson/gds/o186i221/gpu0
#filename=/mnt/test0/gds-01
#rdma_url=sockfs://192.186.0.1:18515
#The following parameter can be used to specify per job start offset. If not defined global section's start offset would be used.
#start_offset=0
#The following parameter can be used to define the size of IO for this job. If not defined, the global size parameter would be used.
#For Xfer mode 6, this is per batch i.e. for 1MB size with a batch size of 4 would
#do 4 MB of I/O.
#size = 8M

[job1]
gpu_dev_id=1
num_threads=$NUM_GDS_THREADS_PER_GPU
directory=/e1000/ccarlson/gds/o186i221/gpu1

[job2]
gpu_dev_id=2
num_threads=$NUM_GDS_THREADS_PER_GPU
directory=/e1000/ccarlson/gds/o186i221/gpu2

[job3]
gpu_dev_id=3
num_threads=$NUM_GDS_THREADS_PER_GPU
directory=/e1000/ccarlson/gds/o186i221/gpu3

[job4]
gpu_dev_id=4
num_threads=$NUM_GDS_THREADS_PER_GPU
directory=/e1000/ccarlson/gds/o186i221/gpu4

[job5]
gpu_dev_id=5
num_threads=$NUM_GDS_THREADS_PER_GPU
directory=/e1000/ccarlson/gds/o186i221/gpu5

[job6]
gpu_dev_id=6
num_threads=$NUM_GDS_THREADS_PER_GPU
directory=/e1000/ccarlson/gds/o186i221/gpu6

[job7]
gpu_dev_id=7
num_threads=$NUM_GDS_THREADS_PER_GPU
directory=/e1000/ccarlson/gds/o186i221/gpu7

Here we have:

  • a [global] section which defines key/val pairs for all jobs.

  • a [job] for each GPU, which are all run in parallel.

    • Each job inherits the values in the [global] section

    • Each job specifies which GPU by index this will run on, and how many threads to use.

    • Each job outputs to its own directory, as to not overflow Lustre’s maximum 256 files per directory limit

Benchmark Entrypoint Script

Lastly, we have a benchmark.sh shell script which kicks off gdsio using the write.gdsio job file:

benchmark.sh
#!/bin/bash

# Benchmark script to run gdsio using a .gdsio job file
[[ $# -ne 1 ]] && echo "Usage: ./benchmark.sh write.gdsio" && exit 1

# Use a custom cufile.json instead of the default /etc/cufile.json
export CUFILE_ENV_PATH_JSON="/home/ccarlson/gds/cufile.json"

GDSIO="/usr/local/cuda-12.3/gds/tools/gdsio"
JOBFILE=$1
JOB_NAME=${JOBFILE%.gdsio}  # remove .gdsio suffix
THREADS_ARRAY=(8)
RESULTS_DIR="single_node_results"
HOSTNAME=$(hostname)
RESULTS_FILE="${RESULTS_DIR}/${HOSTNAME}_${JOB_NAME}_results.out"

rm -f $RESULTS_FILE
touch $RESULTS_FILE
for THREADS in ${THREADS_ARRAY[@]}; do
  echo "Results for $THREADS threads per GPU:" >> $RESULTS_FILE
  export NUM_GDS_THREADS_PER_GPU=$THREADS
  $GDSIO $JOBFILE >> $RESULTS_FILE
done

All in all, this is what the directory structure looks like for my benchmarks:

ccarlson@o186i225:~/gds$ tree .
.
├── benchmark.sh
├── cufile.json
├── cufile_logs
│   ├── cufile_116312_2023-12-14.09:19:31.log
│   ├── cufile_117028_2023-12-14.09:27:48.log
│   ├── cufile_117227_2023-12-14.09:29:42.log
│   ├── cufile_118851_2023-12-14.10:22:11.log
│   └── cufile_121697_2023-12-14.11:47:20.log
└── write.gdsio

1 directory, 8 files

Running Benchmarks

With the above configuration set in place, we can run ./benchmark.sh:

ccarlson@o186i225:~/gds$ ./benchmark.sh
IoType: WRITE XferType: GPUD Threads: 64 DataSetSize: 261804032/134217728(KiB) IOSize: 1024(KiB) Throughput: 6.930582 GiB/sec, Avg_Latency: 9004.751902 usecs ops: 255668 total_time 36.025224 secs