Overview

chili10-101dUbuntu 24.04 · kernel 6.174× L40S · DOCA 3.3.0

gdsio — local NVMe (raid0 + XFS)

Two runs on the local raid0 array (4× Solidigm D7-PS1010 over md raid0, XFS, /gds). Run 1 sweeps block size with a single GPU to find the single-GPU ceiling. Run 2 runs 4 GPUs concurrently at the best block size to find the array aggregate.

Run 1 peak (1 GPU, 8 MiB)

24.3 GiB/s

PCIe 4.0 x16 practical ceiling for a single L40S.

Run 2 peak read (4 GPU, 8 MiB)

53.0 GiB/s

~91% of the 58 GB/s theoretical aggregate (4 × 14.5 GB/s) of the raid0 array.

Run 2 peak write (4 GPU, 8 MiB)

39.5 GiB/s

Lower than read — raid0 write amplification + drive-level SLC cache pressure at sustained 8 MiB blocks.

Run 1

Block-size sweep — single GPU

gdsio · GPU 2 · 4 workers · 8 GiB xfer · /gds (raid0 XFS) · XferType GPUD

Run 2

4-GPU concurrent — raid0 aggregate

gdsio · all 4 L40S concurrent · 4 workers each · 8 MiB block · 16 GiB xfer · /gds (raid0 XFS) · XferType GPUD

Observations

Read/write symmetric past 8 MiB. Below 1 MiB, write beats read (SLC/parallelism wins); past 2 MiB, reads catch up and plateau at ~24 GiB/s — the GPU PCIe link is the bottleneck, not the drives.
Multi-GPU is drive-bound, not link-bound. 4 GPUs at 13.3 GiB/s each ≈ 53 GiB/s aggregate. Adding more GPUs wouldn't help — the raid0 array is the ceiling.
Write efficiency drops at 4 GPUs. Per-GPU write falls from 24 → 9.8 GiB/s. Drive-level write buffers don't parallelize as cleanly as reads under sustained load.
GPU2 picked as single-GPU rep. Chosen via nvidia-smi topo -m — GPU2 shares an IOD quadrant (0x80) with the NVMe drives, so PCIe traffic stays on-socket.

Raw CSVs: 00-blocksize-sweep.csv · multi-gpu-results.csv