Overview
GPUDirect Storage on chili10-101d
This report documents enabling GPUDirect Storage on chili10-101d (dual EPYC 9754, 4× L40S, 4× Solidigm D7-PS1010 NVMe, ConnectX-7 400 GbE) and benchmarking both the local NVMe path (raid0 XFS) and the remote path (NFSoRDMA to a VAST cluster). Six runs total.
L40S × 4EPYC 9754 × 2Solidigm D7-PS1010 × 4 (raid0)ConnectX-7 400 GbEVAST NFSoRDMA
Local NVMe peak read
53.0GiB/s
4 GPUs concurrent, 8 MiB block, raid0 over 4× Solidigm D7-PS101091% of 4-drive theoretical (58 GB/s)
NFSoRDMA peak read
43.4GiB/s
4 GPUs concurrent, 64 workers ea., 1 MiB block, VAST96% of 400 GbE line rate — single NIC saturated
Single-GPU NFSoRDMA
26.4GiB/s
1 GPU, 128 workers, 1 MiB94% of PCIe 4.0 x16 practical ceiling
Stack pinned
4.5.5
vastnfs-dkmsMOFED 24.10 · nvidia-fs 2.28.4 · CUDA 12.9
Peak read bandwidth — all runs
Local NVMe (raid0 XFS) scales with drives; NFSoRDMA on a single ConnectX-7 saturates at ~43 GiB/s against the 400 GbE line rate.
Local NVMe (raid0 XFS)NFSoRDMA (VAST)
Run index
Click any row to jump to the detail page with the full charts, config, and raw data.
| Run | Configuration | Peak write | Peak read |
|---|---|---|---|
| Run 1 | gdsio block-size sweep 1 GPU · 4 thr · 64 KiB→16 MiB · raid0 | 24.2 GiB/s | 24.3 GiB/s |
| Run 2 | gdsio multi-GPU concurrent 4 GPUs · 8 MiB · raid0 | 39.5 GiB/s | 53.0 GiB/s |
| Run 3 | nixlbench GDS vs POSIX vs GDS_MT 1 thread · batch 64 · 64 KiB→16 MiB | POSIX 20.2 / GDS 13.2 | POSIX 21.6 / GDS 12.6 |
| Run 4 | kvbench — Llama 70B/8B + DeepSeek R1 GDS vs POSIX · r=1,10 | — | DeepSeek R1: GDS 16.4 > POSIX 14.8 |
| Run 5 | NFSoRDMA single-GPU thread sweep 1 GPU · 16→256 workers · 1 MiB & 8 MiB | 23.9 GiB/s (64w/8M) | 26.4 GiB/s (128w/1M) |
| Run 6 | NFSoRDMA multi-GPU scaling 2-GPU & 4-GPU concurrent · 64w · 1 MiB | — | 2-GPU: 40.2 · 4-GPU: 43.4 GiB/s |