Machine Learning Benchmarks

System Tuning

The binding configuration for the GPUs needs to be correct.

The idea is to minimize latency going from CPU to GPU. Typically you have NUMA domains, you specify the closest NUMA domain for each CPU. In our system you only have half the PCI lanes, so you have to make a mapping based on cores.

See the current mapping using the following:

nvidia-smi topo -m

Example:

root@o186i221:~/ccarlson/experiments# nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	CPU Affinity	NUMA Affinity
GPU0	 X 	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PXB	SYS	SYS	SYS	SYS	48-63	3
GPU1	PXB	 X 	SYS	SYS	SYS	SYS	SYS	SYS	PXB	SYS	SYS	SYS	SYS	48-63	3
GPU2	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	SYS	PXB	SYS	SYS	SYS	16-31	1
GPU3	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	SYS	PXB	SYS	SYS	SYS	16-31	1
GPU4	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	PXB	SYS	SYS	112-127	7
GPU5	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	PXB	SYS	SYS	112-127	7
GPU6	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PXB	SYS	SYS	SYS	SYS	PXB	80-95	5
GPU7	SYS	SYS	SYS	SYS	SYS	SYS	PXB	 X 	SYS	SYS	SYS	SYS	PXB	80-95	5
NIC0	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS
NIC1	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS
NIC2	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	 X 	SYS	SYS
NIC3	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS
NIC4	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4

In order to prioritize CUDA devices when building with Docker, set the BUILDDOCKER_BUILDKIT environment variable.

BUILDDOCKER_BUILDKIT=1 docker build

MLPerf

MLPerf™ benchmarks - developed by MLCommons, a consortium of AI leaders from academia, research labs, and industry - are designed to provide unbiased evaluations of training and inference performance for hardware, software, and services. They’re all conducted under prescribed conditions. To stay on the cutting edge of industry trends, MLPerf continues to evolve, holding new tests at regular intervals and adding new workloads that represent the state of the art in AI.

MLPerf Training

MLPerf Training presents three unique benchmarking challenges missing from other domains. Optimizations that improve training throughput can increase time to solution, training is randomly determined and time to solution exhibits high variance, and software and hardware are diverse enough to present difficulties for fairer benchmarking with the same binary, code, and even hyperparameters. MLPerf Training tests eight different workloads across various use cases like computer vision, large language models, and recommenders.

While the repository, mlcommons/training, is intended to serve as valid starting points for benchmark implementations, it is not meant to be used for real performance measurements of software frameworks and hardware. The following steps would need to be executed in order to run a benchmark. Speeds may vary based on the reference hardware being used.

  1. Clone the repo mlcommons/training to the host machine.

  2. Run the install_cuda_docker.sh script on the host machine. This will ensure that CUDA and Docker is installed. The script would also take care of setting up the correct docker dependencies.

  3. While being on the host machine and outside of docker, run ./download_dataset.sh for the dataset being used for the benchmark. The script should be run within the directory of the benchmark.

  4. Once the download completes, it can be verified by running ./verify_dataset.sh in the same manner.

  5. Build and run the docker image. Each benchmark has a command to do that.

  6. Once the target quality is reached, the benchmark will stop to produce timing results.

MLPerf Storage

MLPerf Storage is a benchmark suite to characterize the performance of storage systems that support machine learning workloads.

The MLPerf Storage Benchmark Suite is an AI/ML benchmarking suite that measures performance of storage for ML workloads. The benchmark helps bridge the gap between the utilization between storage and compute resources to make sure that they are both used efficiently for ML workloads.

It measures the sustained performance of a storage system for MLPerf Training and HPC workloads on both PyTorch and Tensorflow without requiring the use of expensive accelerators. Additionally, the dlio_benchmark code is used to emulate I/O patterns for deep learning workloads.

Accelerator Utilization

The Accelerator Utilization (AU) is the benchmark output metric samples per second for each workload. In order the calculate the AU, the total compute time and the total benchmark running time is needed. The total compute time is calculated by:

total_compute_time = (records/file * total_files)/simulated_accelerators/batch_size * sleep_time

The formula below calculates the AU:

AU (percentage) = (total_compute_time/total_benchmark_running_time) * 100

MLPerf Storage Installation

First, the mpich for MPI package is needed. This can be done by:

sudo apt-get install mpich

Next, clone the MLCommons Storage repo.

git clone https://github.com/mlcommons/storage.git

The requirements.txt for dlio_benchmark would need to be installed. Before that, a Python virtual environment needs to be set up:

python3 -m venv ~/myvenv
source ~/myvenv/bin/activate

The requirements.txt would then be installed afterwards.

pip install -r dlio_benchmark/requirements.txt

To launch the dlio_benchmark, execute the benchmark.sh script:

./benchmark.sh -h

RESNET-50

The RESNET-50 neural network is a well-known image classification network that can be used with the ImageNet dataset. It computationally intensive and good indication of driving meaningful storage I/O. In order to keep up the training benchmark and the overall run average time, the storage system has to keep up with the read bandwidth demands of a complete training job. The training benchmark are measured at Epoch 0, which is the most I/O-intensive portion of the MLPerf benchmark run. On that note, the RESNET-50 test would verify if the storage system is not a bottleneck for the workload and will provide the same images/second for Epoch 0.

RESNET-50 Installation

In order to run the MLPerf RESNET-50 test, Collective Mind automation language (CM) (also known as ML Commons CM language) would need to be installed. It is part of the MLCommons Collective Knowledge (CK) project, and is powered by Python, JSON, YAML, and a unified CLI. More detailed information on CM can be found here: Collective Minds. The following installation steps assume the host machine will be running on RedHat Enterprise Linux. Furthermore, python 3+, pip, git , and wget would need to be installed beforehand.

The following will install CM:

sudo dnf update
sudo dnf install python3 python-pip git wget curl
python3 -m pip install --user cmind

Once CM is installed, the next step would be to install MLCommons CK repository with automation workflows for MLPerf:

cm pull repo mlcommons@ck

This command will pre-process the dataset for a given backend. The loadgen would be built, and it will run the inference for all scenarios and modes. A submission folder would be created with the test results.

cmr "run mlperf inference generate-run-cmds _submission" \
    --quiet --submitter="MLCommons" --hw_name=default --model=resnet50 --implementation=reference \
    --backend=onnxruntime --device=cpu --scenario=Offline --adr.compiler.tags=gcc  --target_qps=1 \
    --category=edge --division=open

The target_qps value would need to be updated according to the system performance for a valid submission. The following values should be as is,

  • Use --device=cuda to run the inference on Nvidia GPU

  • Use --division=closed to run all scenarios for the closed division (compliance tests are skipped for _find-performance mode)

  • Use --category=datacenter to run datacenter scenarios

  • Use --backend=tf or --backend=tvm-onnx to use tensorflow and tvm-onnx backends, respectively

Credit goes to Sakib Samar for a large portion of these notes.

NCCL Tests

These tests check both the performance and the correctness of NCCL operations (inter-GPU communication).

GitHub repositories with documentation:

Installation

First, you need nccl installed on the nodes you want to include in the benchmark.

Go ahead and clone the nccl repo:

git clone https://github.com/NVIDIA/nccl.git

Change directory into the nccl repo and build it against your system:

cd nccl
make -j src.build

Now, install it (here we’re using Ubuntu 22.04):

# Install tools to create debian packages
sudo apt install build-essential devscripts debhelper fakeroot
# Build NCCL deb package
make pkg.debian.build
ls build/pkg/deb/

Your .deb packaages should now be in build/pkg/deb. You can install them using

apt install /root/ccarlson/nccl/build/pkg/deb/libnccl-dev_2.18.5-1+cuda12.2_amd64.deb /root/ccarlson/nccl/build/pkg/deb/libnccl2_2.18.5-1+cuda12.2_amd64.deb

Next, clone your nccl-tests repo:

git clone https://github.com/NVIDIA/nccl-tests.git

Now, change directory into it and make it against the installation of MPI you have on the system:

cd nccl-tests
make MPI=1 MPI_HOME=/usr/mpi/gcc/openmpi-4.1.5rc2

You should now have nccl-tests binaries available under build/:

root@o186i221:~/ccarlson/nccl-tests# ls build/
all_gather_perf  alltoall_perf   gather_perf     reduce_perf          scatter_perf   timer.o
all_reduce_perf  broadcast_perf  hypercube_perf  reduce_scatter_perf  sendrecv_perf  verifiable

Usage

all_reduce_perf options:

USAGE: all_reduce_perf
	[-t,--nthreads <num threads>]
	[-g,--ngpus <gpus per thread>]
	[-b,--minbytes <min size in bytes>]
	[-e,--maxbytes <max size in bytes>]
	[-i,--stepbytes <increment size>]
	[-f,--stepfactor <increment factor>]
	[-n,--iters <iteration count>]
	[-m,--agg_iters <aggregated iteration count>]
	[-w,--warmup_iters <warmup iteration count>]
	[-p,--parallel_init <0/1>]
	[-c,--check <check iteration count>]
	[-o,--op <sum/prod/min/max/avg/mulsum/all>]
	[-d,--datatype <nccltype/all>]
	[-r,--root <root>]
	[-z,--blocking <0/1>]
	[-y,--stream_null <0/1>]
	[-T,--timeout <time in seconds>]
	[-G,--cudagraph <num graph launches>]
	[-C,--report_cputime <0/1>]
	[-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
	[-h,--help]

Running the NCCL tests on a single node is pretty straightforward:

/root/ccarlson/nccl-tests/build/all_reduce_perf --ngpus 8 --minbytes=128M --maxbytes=1024M --stepfactor=2 --nthreads=1 --iters=1

Running across multiple nodes should be done with MPI (mpirun)

mpirun --allow-run-as-root -np 2 --mca btl_tcp_if_include eth0 --machinefile machinefile.txt --map-by "node" /root/ccarlson/nccl-tests/build/all_reduce_perf --ngpus 8 --minbytes=128M --maxbytes=1024M --stepfactor=2 --nthreads=1 --iters=1

My machinefile for two nodes looks like:

o186i221 slots=64
o186i222 slots=64

This launches two tasks via MPI (-np 2), one per node, each of which runs the all_reduce_perf binary with the specified options.

We have to include the eth0 interface for MPI to communicate over, otherwise it’ll try to send TCP over the IB devices which won’t work.

Example output for multiple nodes:

/build/all_reduce_perf --ngpus 8 --minbytes=128M --maxbytes=1024M --stepfactor=2 --nthreads=1 --iters=1
# nThread 1 nGpus 8 minBytes 134217728 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 1 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 256912 on   o186i221 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 256912 on   o186i221 device  1 [0x0b] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 256912 on   o186i221 device  2 [0x48] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 256912 on   o186i221 device  3 [0x4c] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 256912 on   o186i221 device  4 [0x88] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 256912 on   o186i221 device  5 [0x8b] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 256912 on   o186i221 device  6 [0xc8] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 256912 on   o186i221 device  7 [0xcb] NVIDIA A100-SXM4-80GB
#  Rank  8 Group  0 Pid 540236 on   o186i222 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  9 Group  0 Pid 540236 on   o186i222 device  1 [0x0b] NVIDIA A100-SXM4-80GB
#  Rank 10 Group  0 Pid 540236 on   o186i222 device  2 [0x48] NVIDIA A100-SXM4-80GB
#  Rank 11 Group  0 Pid 540236 on   o186i222 device  3 [0x4c] NVIDIA A100-SXM4-80GB
#  Rank 12 Group  0 Pid 540236 on   o186i222 device  4 [0x88] NVIDIA A100-SXM4-80GB
#  Rank 13 Group  0 Pid 540236 on   o186i222 device  5 [0x8b] NVIDIA A100-SXM4-80GB
#  Rank 14 Group  0 Pid 540236 on   o186i222 device  6 [0xc8] NVIDIA A100-SXM4-80GB
#  Rank 15 Group  0 Pid 540236 on   o186i222 device  7 [0xcb] NVIDIA A100-SXM4-80GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   134217728      33554432     float     sum      -1   5619.6   23.88   44.78      0   5662.5   23.70   44.44      0
   268435456      67108864     float     sum      -1    10796   24.86   46.62      0    10836   24.77   46.45      0
   536870912     134217728     float     sum      -1    14128   38.00   71.25      0    13863   38.73   72.61      0
  1073741824     268435456     float     sum      -1    24419   43.97   82.45      0    24255   44.27   83.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 61.4514
#

GPU Direct Storage (GDS)

You can verify the installation by running

/usr/local/cuda-12.2/gds/tools/gdscheck.py -p

Example output:

 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 4 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 5 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 6 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 7 NVIDIA A100-SXM4-80GB bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

GDS Benchmarking

There are various parameters and settings that will factor into delivered performance. In addition to storage/filesystem-specific parameters, there are system settings and GDS-specific parameters defined in /etc/cufile.json. Default configuration file:

{
    // NOTE : Application can override custom configuration via export CUFILE_ENV_PATH_JSON=<filepath>
    // e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile.json"

            "logging": {
                            // log directory, if not enabled will create log file under current working directory
                            //"dir": "/home/<xxxx>",

                            // NOTICE|ERROR|WARN|INFO|DEBUG|TRACE (in decreasing order of severity)
                            "level": "ERROR"
            },

            "profile": {
                            // nvtx profiling on/off
                            "nvtx": false,
                            // cufile stats level(0-3)
                            "cufile_stats": 0
            },

            "execution" : {
                    // max number of workitems in the queue;
                    "max_io_queue_depth": 128,
                    // max number of host threads per gpu to spawn for parallel IO
                    "max_io_threads" : 4,
                    // enable support for parallel IO
                    "parallel_io" : true,
                    // minimum IO threshold before splitting the IO
                    "min_io_threshold_size_kb" : 8192,
		    // maximum parallelism for a single request
		    "max_request_parallelism" : 4
            },

            "properties": {
                            // max IO chunk size (parameter should be multiples of 64K) used by cuFileRead/Write internally per IO request
                            "max_direct_io_size_kb" : 16384,
                            // device memory size (parameter should be 4K aligned) for reserving bounce buffers for the entire GPU
                            "max_device_cache_size_kb" : 131072,
                            // limit on maximum device memory size (parameter should be 4K aligned) that can be pinned for a given process
                            "max_device_pinned_mem_size_kb" : 33554432,
                            // true or false (true will enable asynchronous io submission to nvidia-fs driver)
                            // Note : currently the overall IO will still be synchronous
                            "use_poll_mode" : false,
                            // maximum IO request size (parameter should be 4K aligned) within or equal to which library will use polling for IO completion
                            "poll_mode_max_size_kb": 4,
                            // allow compat mode, this will enable use of cuFile posix read/writes
                            "allow_compat_mode": true,
                            // enable GDS write support for RDMA based storage
                            "gds_rdma_write_support": true,
                            // GDS batch size
                            "io_batchsize": 128,
                            // enable io priority w.r.t compute streams
                            // valid options are "default", "low", "med", "high"
                            "io_priority": "default",
                            // client-side rdma addr list for user-space file-systems(e.g ["10.0.1.0", "10.0.2.0"])
                            "rdma_dev_addr_list": [ ],
                            // load balancing policy for RDMA memory registration(MR), (RoundRobin, RoundRobinMaxMin)
                            // In RoundRobin, MRs will be distributed uniformly across NICS closest to a GPU
                            // In RoundRobinMaxMin, MRs will be distributed across NICS closest to a GPU
                            // with minimal sharing of NICS acros GPUS
                            "rdma_load_balancing_policy": "RoundRobin",
			    //32-bit dc key value in hex
			    //"rdma_dc_key": "0xffeeddcc",
			    //To enable/disable different rdma OPs use the below bit map
			    //Bit 0 - If set enables Local RDMA WRITE
			    //Bit 1 - If set enables Remote RDMA WRITE
			    //Bit 2 - If set enables Remote RDMA READ
			    //Bit 3 - If set enables REMOTE RDMA Atomics
			    //Bit 4 - If set enables Relaxed ordering.
			    //"rdma_access_mask": "0x1f",

                            // In platforms where IO transfer to a GPU will cause cross RootPort PCie transfers, enabling this feature
                            // might help improve overall BW provided there exists a GPU(s) with Root Port common to that of the storage NIC(s).
                            // If this feature is enabled, please provide the ip addresses used by the mount either in file-system specific
                            // section for mount_table or in the rdma_dev_addr_list property in properties section
                            "rdma_dynamic_routing": false,
                            // The order describes the sequence in which a policy is selected for dynamic routing for cross Root Port transfers
                            // If the first policy is not applicable, it will fallback to the next and so on.
                            // policy GPU_MEM_NVLINKS: use GPU memory with NVLink to transfer data between GPUs
                            // policy GPU_MEM: use GPU memory with PCIe to transfer data between GPUs
                            // policy SYS_MEM: use system memory with PCIe to transfer data to GPU
                            // policy P2P: use P2P PCIe to transfer across between NIC and GPU
                            "rdma_dynamic_routing_order": [ "GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P" ]
            },

            "fs": {
                    "generic": {

                            // for unaligned writes, setting it to true will, cuFileWrite use posix write internally instead of regular GDS write
                            "posix_unaligned_writes" : false
                    },

		    "beegfs" : {
                            // IO threshold for read/write (param should be 4K aligned)) equal to or below which cuFile will use posix read/write
                            "posix_gds_min_kb" : 0

                            // To restrict the IO to selected IP list, when dynamic routing is enabled
                            // if using a single BeeGFS mount, provide the ip addresses here
                            //"rdma_dev_addr_list" : []

                            // if using multiple lustre mounts, provide ip addresses used by respective mount here
                            //"mount_table" : {
                            //                    "/beegfs/client1" : {
                            //                                    "rdma_dev_addr_list" : ["172.172.1.40", "172.172.1.42"]
                            //                    },

                            //                    "/beegfs/client2" : {
                            //                                    "rdma_dev_addr_list" : ["172.172.2.40", "172.172.2.42"]
                            //                    }
                            //}

		    },
                    "lustre": {

                            // IO threshold for read/write (param should be 4K aligned)) equal to or below which cuFile will use posix read/write
                            "posix_gds_min_kb" : 0

                            // To restrict the IO to selected IP list, when dynamic routing is enabled
                            // if using a single lustre mount, provide the ip addresses here (use : sudo lnetctl net show)
                            //"rdma_dev_addr_list" : []

                            // if using multiple lustre mounts, provide ip addresses used by respective mount here
                            //"mount_table" : {
                            //                    "/lustre/ai200_01/client" : {
                            //                                    "rdma_dev_addr_list" : ["172.172.1.40", "172.172.1.42"]
                            //                    },

                            //                    "/lustre/ai200_02/client" : {
                            //                                    "rdma_dev_addr_list" : ["172.172.2.40", "172.172.2.42"]
                            //                    }
                            //}
                    },

                    "nfs": {

                           // To restrict the IO to selected IP list, when dynamic routing is enabled
                           //"rdma_dev_addr_list" : []

                           //"mount_table" : {
                           //                     "/mnt/nfsrdma_01/" : {
                           //                                     "rdma_dev_addr_list" : []
                           //                     },

                           //                     "/mnt/nfsrdma_02/" : {
                           //                                     "rdma_dev_addr_list" : []
                           //                     }
                           //}
                    },

		    "gpfs": {
                           //allow GDS writes with GPFS
                           "gds_write_support": false

                           //"rdma_dev_addr_list" : []

                           //"mount_table" : {
                           //                     "/mnt/gpfs_01" : {
                           //                                     "rdma_dev_addr_list" : []
                           //                     },

                           //                     "/mnt/gpfs_02/" : {
                           //                                     "rdma_dev_addr_list" : []
                           //                     }
                           //}
                    },

                    "weka": {

                            // enable/disable RDMA write
                            "rdma_write_support" : false
                    }
            },

            "denylist": {
                            // specify list of vendor driver modules to deny for nvidia-fs (e.g. ["nvme" , "nvme_rdma"])
                            "drivers":  [ ],

                            // specify list of block devices to prevent IO using cuFile (e.g. [ "/dev/nvme0n1" ])
                            "devices": [ ],

                            // specify list of mount points to prevent IO using cuFile (e.g. ["/mnt/test"])
                            "mounts": [ ],

                            // specify list of file-systems to prevent IO using cuFile (e.g ["lustre", "wekafs"])
                            "filesystems": [ ]
            },

            "miscellaneous": {
                            // enable only for enforcing strict checks at API level for debugging
                            "api_check_aggressive": false
	    }
}

gdsio

Usage help:

gdsio version :1.11
Usage [using config file]: gdsio rw-sample.gdsio
Usage [using cmd line options]:/usr/local/cuda-12.2/gds/tools/gdsio
         -f <file name>
         -D <directory name>
         -d <gpu_index (refer nvidia-smi)>
         -n <numa node>
         -m <memory type(0 - (cudaMalloc), 1 - (cuMem), 2 - (cudaMallocHost), 3 - (malloc) 4 - (mmap))>
         -w <number of threads for a job>
         -s <file size(K|M|G)>
         -o <start offset(K|M|G)>
         -i <io_size(K|M|G)> <min_size:max_size:step_size>
         -p <enable nvlinks>
         -b <skip bufregister>
         -V <verify IO>
         -x <xfer_type> [0(GPU_DIRECT), 1(CPU_ONLY), 2(CPU_GPU), 3(CPU_ASYNC_GPU), 4(CPU_CACHED_GPU), 5(GPU_DIRECT_ASYNC), 6(GPU_BATCH), 7(GPU_BATCH_STREAM)]
         -B <batch size>
         -I <(read) 0|(write)1| (randread) 2| (randwrite) 3>
         -T <duration in seconds>
         -k <random_seed> (number e.g. 3456) to be used with random read/write>
         -U <use unaligned(4K) random offsets>
         -R <fill io buffer with random data>
         -F <refill io buffer with random data during each write>
         -a <alignment size in case of random IO>
         -P <rdma url>
         -J <per job statistics>

xfer_type:
0 - Storage->GPU (GDS)
1 - Storage->CPU
2 - Storage->CPU->GPU
3 - Storage->CPU->GPU_ASYNC
4 - Storage->PAGE_CACHE->CPU->GPU
5 - Storage->GPU_ASYNC
6 - Storage->GPU_BATCH
7 - Storage->GPU_BATCH_STREAM

Note:
read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option
read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option, using same random seed (-k),
same number of threads(-w), offset(-o), and data size(-s)
write test (-I 1/3) with verify option (-V) will perform writes followed by read

Example run:

root@o186i221:~# gdsio -I 1 -i 512K -s 16M -w 128 -d 0 -x 0 -D /mnt/cstor1/ccarlson/flash
IoType: WRITE XferType: GPUD Threads: 128 DataSetSize: 1630208/2097152(KiB) IOSize: 512(KiB) Throughput: 7.804225 GiB/sec, Avg_Latency: 6595.108546 usecs ops: 3184 total_time 0.199211 secs

Here’s a script that helps with the runs, and has comments explaning what each option does.

#!/bin/bash

set -ex
POOL_DIR="/mnt/cstor1/ccarlson/flash"
GDSIO_BIN="/usr/local/cuda-12.2/gds/tools/gdsio"

rm -rf $POOL_DIR/*

# Set up a directory per GPU on the host
for GPU_INDEX in {0..8}; do
  mkdir -p $POOL_DIR/$(hostname)/$GPU_INDEX
done

echo "Created the following directory structure"
tree $POOL_DIR

# Optionally, mpirun gdsio instances
#mpirun --machinefile machinefile.txt --allow-run-as-root \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 0 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/0 : \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 1 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/1 : \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 2 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/2 : \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 3 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/3 : \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 4 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/4 : \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 5 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/5 : \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 6 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/6 : \
#	-np 1 $GDSIO_BIN -I 1 -i 512K -s 128M -p -x 0 -d 7 -w 128 -D /mnt/cstor1/ccarlson/flash/$(hostname)/7 \

# Run gdsio directly
gdsio -I 1 -i 1024K -s 256M -p -x 0 -d 0 -w 256 -D /mnt/cstor1/ccarlson/flash/$(hostname)/0