Overview
gdsio — local NVMe (raid0 + XFS)
Two runs on the local raid0 array (4× Solidigm D7-PS1010 over md raid0, XFS, /gds). Run 1 sweeps block size with a single GPU to find the single-GPU ceiling. Run 2 runs 4 GPUs concurrently at the best block size to find the array aggregate.
Run 1 peak (1 GPU, 8 MiB)
24.3 GiB/s
PCIe 4.0 x16 practical ceiling for a single L40S.
Run 2 peak read (4 GPU, 8 MiB)
53.0 GiB/s
~91% of the 58 GB/s theoretical aggregate (4 × 14.5 GB/s) of the raid0 array.
Run 2 peak write (4 GPU, 8 MiB)
39.5 GiB/s
Lower than read — raid0 write amplification + drive-level SLC cache pressure at sustained 8 MiB blocks.
Run 1
Block-size sweep — single GPU
gdsio · GPU 2 · 4 workers · 8 GiB xfer · /gds (raid0 XFS) · XferType GPUD
Run 2
4-GPU concurrent — raid0 aggregate
gdsio · all 4 L40S concurrent · 4 workers each · 8 MiB block · 16 GiB xfer · /gds (raid0 XFS) · XferType GPUD
Observations
- Read/write symmetric past 8 MiB. Below 1 MiB, write beats read (SLC/parallelism wins); past 2 MiB, reads catch up and plateau at ~24 GiB/s — the GPU PCIe link is the bottleneck, not the drives.
- Multi-GPU is drive-bound, not link-bound. 4 GPUs at 13.3 GiB/s each ≈ 53 GiB/s aggregate. Adding more GPUs wouldn't help — the raid0 array is the ceiling.
- Write efficiency drops at 4 GPUs. Per-GPU write falls from 24 → 9.8 GiB/s. Drive-level write buffers don't parallelize as cleanly as reads under sustained load.
- GPU2 picked as single-GPU rep. Chosen via
nvidia-smi topo -m— GPU2 shares an IOD quadrant (0x80) with the NVMe drives, so PCIe traffic stays on-socket.
Raw CSVs: 00-blocksize-sweep.csv · multi-gpu-results.csv