GPUDirect Async Kernel-Initiated (GDAKI)
- GDAKI is a GPU-initiated architecture that enables direct RDMA operations, bypassing the CPU for low-latency communication.
- It leverages DOCA GPUNetIO within the NCCL framework to tightly integrate communication and computation, benefitting models like Mixture-of-Experts.
- Benchmark results show GDAKI reduces latency by up to 9% in multi-node setups while achieving near line-rate bandwidth performance.
GPUDirect Async Kernel-Initiated (GDAKI) is a backend architecture in NVIDIA’s NCCL GPU-Initiated Networking (GIN) stack that enables CUDA device-side code to directly initiate and manage network RDMA operations, fully bypassing the CPU during both communication and completion handling. The approach achieves low-latency, fine-grained GPU-to-GPU communication integrated seamlessly into NCCL’s production-grade collective infrastructure, leveraging DOCA GPUNetIO for direct GPU–NIC interaction. GDAKI is particularly relevant to modern AI workloads with tight computation-communication coupling requirements, such as Mixture-of-Experts (MoE) models, where traditional host-initiated paradigms incur prohibitive coordination overhead (Hamidouche et al., 19 Nov 2025).
1. Hardware-Software Architecture
GDAKI operates through a hardware-software pipeline involving the following core components:
- GPU: Provides streaming multiprocessors (SMs), CUDA cores, on-GPU memory for work-queue entries (WQEs), and PCIe/NVLink BAR mappings into the device address space.
- NIC: Typically an InfiniBand/RoCE adapter (e.g., ConnectX-6 Dx or newer), exposing doorbell registers, send/receive QPs, and CQs mapped into GPU-visible BARs.
- NIC DMA Engines: Fetch WQEs from GPU memory, drive one-sided RDMA transactions (READ/WRITE/ATOMICS), and post completion notifications back into GPU-visible memory.
- DOCA GPUNetIO: API and library providing kernel-side verbs for GPU→NIC communication with minimal overhead.
The data path follows this sequence:
- Host-side Initialization: The host (via
ncclCommInitRank) allocates collective symmetric memory windows, initializes GDAKI context using DOCA GPUNetIO to set up NIC QPs, CQs, and BAR locations, exchanges lkey/rkey handles, and makes the context device-accessible. - Device-side Issuance: CUDA kernels instantiate an
ncclGinobject referencing the GIN context. Application threads enqueue RDMA WQEs (encoding source, destination, length, rkey, and notification flags) in an on-GPU circular buffer. - NIC Notification: Device-side code writes to the NIC's doorbell register (by PCIe/NVLink BAR), which triggers the NIC DMA engines to read WQEs and issue network operations.
- Completion Polling: Upon RDMA completion, the NIC posts completion entries to a GPU-visible CQ buffer. Threads poll this buffer using APIs such as
gin.readSignalorgin.waitSignalto detect operation completion and ensure remote visibility.
This fully device-initiated flow provides “zero-CPU-overhead” communication in production settings, eliminating host thread interruptions and CPU context-switching.
2. Programming Model and Device API Integration
The device-facing interface for GDAKI is encapsulated in the ncclGin class of the NCCL Device API, which maps directly onto DOCA GPUNetIO device verbs. The API supports collective operations at the remote memory window granularity, including atomic updates and remote signaling for synchronization.
Example kernel-side sequence for a remote PUT with notification:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
__global__ void stageAndSend(ncclDevComm devComm, ncclWindow_t localWin, size_t localOff, uint64_t remoteAddr, int peer, int signalId) { // 1. Initialize device context ncclGin gin(devComm, /*contextIndex=*/0); // 2. Remote write with signaling gin.put(/*team=*/ncclTeamWorld(devComm), peer, /*dstWindow*/remoteAddr, /*dstOffset*/0, /*srcWindow*/localWin, /*srcOffset*/localOff, /*bytes*/N, ncclGin_SignalInc{signalId}); // 3. Optional completion notification gin.waitSignal(ncclCoopCta(), signalId, /*expected=*/1); gin.resetSignal(signalId); } |
Internally, operations like gin.put become inlined GPUNetIO calls such as gpi_write() or gpi_atomic(), plus a BAR store to ring the NIC’s doorbell. Completion APIs (readSignal, waitSignal, readCounter, waitCounter) poll GPU-visible memory. There is no host-involved proxy or kernel launches required for progress or completion notification.
3. Performance Characteristics and Modeling
Latency and bandwidth in GDAKI are governed by three principal terms in a first-order model:
Where:
- : Time to write WQE and ring doorbell (~hundreds of ns)
- : On-wire network latency ( ≈ 130 ns for InfiniBand) plus message serialization at bandwidth (e.g., 400 Gb/s)
- : Time to poll completion queue entry (tens–hundreds of ns).
For round-trip operations, the network term is doubled and two issue-complete pairs are incurred. For steady-state k-message pipelines, bandwidth asymptotically approaches , subject to NIC DMA fetch efficiency and GPU memory performance.
4. Empirical Microbenchmark and Application Results
Extensive benchmarking on H100 GPUs and InfiniBand networks demonstrates:
| Mode | GDAKI (DOCA GPUNetIO) | NCCL Proxy | NVSHMEM IBGDA | NVSHMEM IBRC |
|---|---|---|---|---|
| Point-to-Point RTT (4–128 B, µs) | 16.7 | 18.0 | 24.3 | 16.0 |
| DeepEP Aggregate RDMA (GB/s) | ~84 (dispatch) | – | ~84 | – |
| 8-GPU E2E Latency (40.6 µs) | 40.6 | – | 41.4 | – |
| 2-Node E2E Latency (µs) | 142.5 | – | 157.0 | – |
Key findings:
- GDAKI direct GPU→NIC path outperforms the CPU-assisted proxy by ~1–2 µs and host-initiated RDMA by ~7–8 µs for small messages.
- Bandwidth matches or exceeds (within 1–2%) best-known figures at scale (~400 Gb/s per link).
- DeepEP Mixture-of-Experts kernels on GDAKI see up to 9% lower latency in multi-node low-latency mode, with high-throughput mode matched exactly.
- CPU involvement is eliminated: zero CPU threads per communicator, freeing host cores for other use.
- Multi-node “combine” performance is within 1–3% of NVSHMEM.
5. Hardware and Software Prerequisites
The following requirements must be met for GDAKI operation:
- NIC: NVIDIA ConnectX-6 Dx or newer, or BlueField-2/3 DPU with GPUNetIO support.
- PCIe/NVLink Topology: GPU and NIC must share the same root complex, ensuring peer-to-peer BAR mappings and low-latency path.
- GPUs: Volta-class (sm_70+) or newer; CUDA toolkit 12.x or higher.
- Kernel Modules and Drivers:
nv_peer_memordmabuffor GPUDirect RDMA, OFED/MOFED stack matching NIC firmware (e.g., NVIDIA OFED 5.x). - Software Stack: DOCA SDK with appropriate
doca-gpunetiolibrary version aligned to firmware.
This enables direct access and control of NIC hardware resources from CUDA devices.
6. Integration into NCCL and MoE Libraries
GDAKI is integrated into NCCL as a plugin backend:
- NCCL Core: At communicator initialization, the runtime detects GPUNetIO capability, loads the GDAKI plugin, allocates QPs/BARs, performs memory registration (for lkey/rkey and address handles), and distributes this context to all participating ranks.
- Memory Windows: Application code registers CUDA pointers with NCCL via
ncclCommWindowRegister, resulting in ancclWindow_thandle via DOCA GPUNetIO. - DeepEP Example: Replaces NVSHMEM one-sided routines with
ncclGinmethods. Mapping pointer arithmetic to (window, offset) pairs and instantiating multiple communicators if required (e.g., 24 QPs per rank). - CUDA Kernel Usage: Importing
<nccl_device_api.h>, device code callsncclGinmethods for put and signaling; host-side proxies or thread launches are not needed.
Integration is transparent for higher-level frameworks including PyTorch, JAX, and TensorRT-LLM, and typically requires only minor kernel changes.
7. Summary of Observed Benefits
GDAKI achieves the following experimentally validated benefits:
- Sub-17 µs round-trip latency for small (4–128 B) messages.
- Bandwidth scalability up to line-rate (≈400 Gb/s per link) across all tested collective modes.
- Elimination of host-side synchronization and context switching, resulting in zero CPU thread overhead for communication progress.
- Application-level performance (MoE on DeepEP) at or above that of competing approaches, and up to 9% latency reduction in multi-node scenarios with minimal code changes for adoption.
- Fully device-side controllable collective operations, enabling tight fusion of communication and computation within CUDA kernels (Hamidouche et al., 19 Nov 2025).
GDAKI, in conjunction with DOCA GPUNetIO and NCCL's collective runtime, represents a tightly integrated architecture for efficient one-sided RDMA communication in GPU-centric distributed AI workloads.