Overview

How GPUDirect Storage works — and why the ceilings are what they are

A short tour of the data paths that produced the numbers in this report.

Path 1
Local NVMe → GPU (direct)
cuFile issues a P2P DMA on behalf of the NVMe driver. Bytes go NVMe → CPU root complex → GPU BAR1, never touching DRAM.
DRAM(not touched — no bounce)unusedNVMe driveSolidigm D7-PS1010 · Gen5 x4CPU root complexEPYC 9554 IODGPU BAR1L40S · 64 GiB BAR1→ VRAM tensorPCIe Gen5 x4P2P DMA(no bounce, no copy)Measured: 24.3 GiB/s single-GPU · 53.0 GiB/s 4-GPU aggregate
  • Requires: nvidia-fs kernel module, GPU BAR1 exposed via ReBAR, IOMMU in passthrough or off.
  • Verified with: gdscheck -p and XferType: GPUD in the benchmark output.
Path 2
NFSoRDMA → GPU (direct)
vastnfs-dkms plumbs RDMA reads/writes through libcufile_rdma directly into GPU BAR1. The NIC DMAs into VRAM the same way local NVMe does.
DRAM(bypassed)unusedVAST CNode5 CNodes, 5 VIPsSN5600leaf switchConnectX-7400 GbE · mlx5CPU root complexnvidia_peermem MR on BAR1GPU BAR1L40S · VRAM targetRoCE v2PFC pri 3PCIeP2P DMApeer-to-peer via libcufile_rdmaMeasured: 26.4 GiB/s single-GPU · 43.4 GiB/s 4-GPU aggregate (96% of 400 GbE)
  • Requires: nvidia_peermem, libcufile_rdma, vastnfs-dkms (not the in-tree rpcrdma), MOFED-built mlnx-nfsrdma-dkms.
  • Verified with: cat /proc/fs/nfsfs/cbstats showing RDMA xprts and XferType: GPUD in gdsio output.
Why the ceilings are what they are
Three independent ceilings — GPU uplink, drive aggregate, and NIC line rate — each constraint takes over as the workload scales.
CeilingTheoreticalMeasured%Limits
Single-GPU PCIe 4.0 x16~28 GiB/s26.494%TLP overhead + GPU endpoint credits. Run 5 peak.
4× drive raid0~58 GB/s53.091%4× Solidigm D7-PS1010 spec ~14.5 GB/s read. Run 2 peak.
400 GbE NIC line rate~45 GiB/s43.496%Ethernet + RoCE v2 + NFS framing overhead. Run 6 peak.
GDS vs POSIX compatibility mode
When libcufile can't use the direct path, it falls back to POSIX: a read into a DRAM bounce buffer followed by a cudaMemcpyAsync. Always works. Costs bandwidth.
GDSNVMe / NICP2P DMA · 1 DMA · 0 copiesGPU VRAM← directPOSIXNVMe / NIC1 DMADRAM bouncecudaMemcpyGPU VRAMcosts: 2× DRAM BW pressure · 2× DMA latency stackedkernel readahead can hide this on single-thread reads(see nixlbench Run 3) — falls apart under multi-worker load

On writes, compat mode costs nearly 2× bandwidth — the data is copied into DRAM before being pushed out. On reads, the impact is less because DRAM can prefetch; nixlbench Run 3 shows POSIX actually winning at single-thread because of kernel readahead, but that advantage disappears once you fan out across workers.

Where GDS wins vs where POSIX wins
GDS wins when
  • Large batches + many submitter threads (multi-GPU, multi-worker)
  • Small KV blocks where per-I/O overhead dominates (DeepSeek R1)
  • DRAM bandwidth is contested (multi-tenant, high-core-count boxes)
  • Real inference/training workloads where the GPU pulls from VRAM anyway
POSIX holds up when
  • Single-threaded AIO is the shape of the workload
  • Large sequential reads where kernel readahead is effective
  • Multi-MB blocks at shallow queue depth (Llama 70B r=1)
  • DRAM is not contested — plenty of memory channels free