Overview

chili10-101dUbuntu 24.04 · kernel 6.174× L40S · DOCA 3.3.0

GPUDirect Storage on chili10-101d

This report documents enabling GPUDirect Storage on chili10-101d (dual EPYC 9754, 4× L40S, 4× Solidigm D7-PS1010 NVMe, ConnectX-7 400 GbE) and benchmarking both the local NVMe path (raid0 XFS) and the remote path (NFSoRDMA to a VAST cluster). Six runs total.

L40S × 4EPYC 9754 × 2Solidigm D7-PS1010 × 4 (raid0)ConnectX-7 400 GbEVAST NFSoRDMA

Local NVMe peak read

53.0GiB/s

4 GPUs concurrent, 8 MiB block, raid0 over 4× Solidigm D7-PS101091% of 4-drive theoretical (58 GB/s)

NFSoRDMA peak read

43.4GiB/s

4 GPUs concurrent, 64 workers ea., 1 MiB block, VAST96% of 400 GbE line rate — single NIC saturated

Single-GPU NFSoRDMA

26.4GiB/s

1 GPU, 128 workers, 1 MiB94% of PCIe 4.0 x16 practical ceiling

Stack pinned

4.5.5

vastnfs-dkmsMOFED 24.10 · nvidia-fs 2.28.4 · CUDA 12.9

Peak read bandwidth — all runs

Local NVMe (raid0 XFS) scales with drives; NFSoRDMA on a single ConnectX-7 saturates at ~43 GiB/s against the 400 GbE line rate.

Local NVMe (raid0 XFS)NFSoRDMA (VAST)

Run index

Click any row to jump to the detail page with the full charts, config, and raw data.

Run	Configuration	Peak write	Peak read
Run 1	gdsio block-size sweep 1 GPU · 4 thr · 64 KiB→16 MiB · raid0	24.2 GiB/s	24.3 GiB/s
Run 2	gdsio multi-GPU concurrent 4 GPUs · 8 MiB · raid0	39.5 GiB/s	53.0 GiB/s
Run 3	nixlbench GDS vs POSIX vs GDS_MT 1 thread · batch 64 · 64 KiB→16 MiB	POSIX 20.2 / GDS 13.2	POSIX 21.6 / GDS 12.6
Run 4	kvbench — Llama 70B/8B + DeepSeek R1 GDS vs POSIX · r=1,10	—	DeepSeek R1: GDS 16.4 > POSIX 14.8
Run 5	NFSoRDMA single-GPU thread sweep 1 GPU · 16→256 workers · 1 MiB & 8 MiB	23.9 GiB/s (64w/8M)	26.4 GiB/s (128w/1M)
Run 6	NFSoRDMA multi-GPU scaling 2-GPU & 4-GPU concurrent · 64w · 1 MiB	—	2-GPU: 40.2 · 4-GPU: 43.4 GiB/s