Overview
KV Bench — model-realistic KV-cache transfers
kvbench wraps nixlbench with model-appropriate block sizes and batch counts drawn from three real model configs. This is the closest single-GPU benchmark to actual inference-serving behavior.
At DeepSeek R1's 549 KB blocks, POSIX-AIO's per-I/O syscall + DMA-bounce overhead starts to dominate. GDS's in-kernel P2P path has lower per-op cost, so it pulls ahead even at single-thread depth.
Rule of thumb from this run: for sub-MB KV blocks, prefer GDS. For multi-MB blocks, POSIX is competitive until batch depth overwhelms the AIO pool.
With nvidia-fs 2.28.4 + CUDA 12.9, GDS hangs indefinitely when kvbench --num_requests scales the effective batch past ~256 submissions in flight. POSIX r=10 completes normally. We did not pursue workarounds here; tracked for a later libcufile upgrade.
r=1 worked for all three models. Those are the GDS numbers shown.
| Backend | Prep μs | Post μs | Tx μs | Avg GiB/s |
|---|---|---|---|---|
| GDS (r=1) | ~28 | ~8 | ~20000 | 12.5 |
| POSIX (r=1) | ~25 | ~14000 | ~1000 | 21.4 |
| POSIX (r=10) | ~25 | ~20000 | ~250 | 19.5 |
Figures extracted from nixlbench output in kvbench logs. GDS Tx dominates because the whole transfer happens inside that phase; POSIX Post dominates because the DRAM bounce is counted there.