GPUDirect Async Kernel-Initiated (GDAKI)

Updated 29 January 2026

GDAKI is a GPU-initiated architecture that enables direct RDMA operations, bypassing the CPU for low-latency communication.
It leverages DOCA GPUNetIO within the NCCL framework to tightly integrate communication and computation, benefitting models like Mixture-of-Experts.
Benchmark results show GDAKI reduces latency by up to 9% in multi-node setups while achieving near line-rate bandwidth performance.

GPUDirect Async Kernel-Initiated (GDAKI) is a backend architecture in NVIDIA’s NCCL GPU-Initiated Networking (GIN) stack that enables CUDA device-side code to directly initiate and manage network RDMA operations, fully bypassing the CPU during both communication and completion handling. The approach achieves low-latency, fine-grained GPU-to-GPU communication integrated seamlessly into NCCL’s production-grade collective infrastructure, leveraging DOCA GPUNetIO for direct GPU–NIC interaction. GDAKI is particularly relevant to modern AI workloads with tight computation-communication coupling requirements, such as Mixture-of-Experts (MoE) models, where traditional host-initiated paradigms incur prohibitive coordination overhead (Hamidouche et al., 19 Nov 2025).

1. Hardware-Software Architecture

GDAKI operates through a hardware-software pipeline involving the following core components:

GPU: Provides streaming multiprocessors (SMs), CUDA cores, on-GPU memory for work-queue entries (WQEs), and PCIe/NVLink BAR mappings into the device address space.
NIC: Typically an InfiniBand/RoCE adapter (e.g., ConnectX-6 Dx or newer), exposing doorbell registers, send/receive QPs, and CQs mapped into GPU-visible BARs.
NIC DMA Engines: Fetch WQEs from GPU memory, drive one-sided RDMA transactions (READ/WRITE/ATOMICS), and post completion notifications back into GPU-visible memory.
DOCA GPUNetIO: API and library providing kernel-side verbs for GPU→NIC communication with minimal overhead.

The data path follows this sequence:

Host-side Initialization: The host (via ncclCommInitRank) allocates collective symmetric memory windows, initializes GDAKI context using DOCA GPUNetIO to set up NIC QPs, CQs, and BAR locations, exchanges lkey/rkey handles, and makes the context device-accessible.
Device-side Issuance: CUDA kernels instantiate an ncclGin object referencing the GIN context. Application threads enqueue RDMA WQEs (encoding source, destination, length, rkey, and notification flags) in an on-GPU circular buffer.
NIC Notification: Device-side code writes to the NIC's doorbell register (by PCIe/NVLink BAR), which triggers the NIC DMA engines to read WQEs and issue network operations.
Completion Polling: Upon RDMA completion, the NIC posts completion entries to a GPU-visible CQ buffer. Threads poll this buffer using APIs such as gin.readSignal or gin.waitSignal to detect operation completion and ensure remote visibility.

This fully device-initiated flow provides “zero-CPU-overhead” communication in production settings, eliminating host thread interruptions and CPU context-switching.

2. Programming Model and Device API Integration

The device-facing interface for GDAKI is encapsulated in the ncclGin class of the NCCL Device API, which maps directly onto DOCA GPUNetIO device verbs. The API supports collective operations at the remote memory window granularity, including atomic updates and remote signaling for synchronization.

Example kernel-side sequence for a remote PUT with notification:

__global__ void stageAndSend(ncclDevComm devComm,
                             ncclWindow_t localWin, size_t localOff,
                             uint64_t remoteAddr, int peer, int signalId) {
  // 1. Initialize device context
  ncclGin gin(devComm, /*contextIndex=*/0);

  // 2. Remote write with signaling
  gin.put(/*team=*/ncclTeamWorld(devComm),
          peer,
          /*dstWindow*/remoteAddr, /*dstOffset*/0,
          /*srcWindow*/localWin, /*srcOffset*/localOff,
          /*bytes*/N,
          ncclGin_SignalInc{signalId});

  // 3. Optional completion notification
  gin.waitSignal(ncclCoopCta(), signalId, /*expected=*/1);
  gin.resetSignal(signalId);
}

Internally, operations like gin.put become inlined GPUNetIO calls such as gpi_write() or gpi_atomic(), plus a BAR store to ring the NIC’s doorbell. Completion APIs (readSignal, waitSignal, readCounter, waitCounter) poll GPU-visible memory. There is no host-involved proxy or kernel launches required for progress or completion notification.

3. Performance Characteristics and Modeling

Latency and bandwidth in GDAKI are governed by three principal terms in a first-order model:

$T_{\mathrm{total}}(S) = T_{\mathrm{issue}} + T_{\mathrm{network}}(S) + T_{\mathrm{complete}}$

Where:

$T_{\mathrm{issue}}$ : Time to write WQE and ring doorbell (~hundreds of ns)
$T_{\mathrm{network}}(S) = L_0 + \frac{S}{B}$ : On-wire network latency ( $L_0$ ≈ 130 ns for InfiniBand) plus message serialization at bandwidth $B$ (e.g., 400 Gb/s)
$T_{\mathrm{complete}}$ : Time to poll completion queue entry (tens–hundreds of ns).

For round-trip operations, the network term is doubled and two issue-complete pairs are incurred. For steady-state k-message pipelines, bandwidth asymptotically approaches $B$ , subject to NIC DMA fetch efficiency and GPU memory performance.

4. Empirical Microbenchmark and Application Results

Extensive benchmarking on H100 GPUs and InfiniBand networks demonstrates:

Mode	GDAKI (DOCA GPUNetIO)	NCCL Proxy	NVSHMEM IBGDA	NVSHMEM IBRC
Point-to-Point RTT (4–128 B, µs)	16.7	18.0	24.3	16.0
DeepEP Aggregate RDMA (GB/s)	~84 (dispatch)	–	~84	–
8-GPU E2E Latency (40.6 µs)	40.6	–	41.4	–
2-Node E2E Latency (µs)	142.5	–	157.0	–

Key findings:

GDAKI direct GPU→NIC path outperforms the CPU-assisted proxy by ~1–2 µs and host-initiated RDMA by ~7–8 µs for small messages.
Bandwidth matches or exceeds (within 1–2%) best-known figures at scale (~400 Gb/s per link).
DeepEP Mixture-of-Experts kernels on GDAKI see up to 9% lower latency in multi-node low-latency mode, with high-throughput mode matched exactly.
CPU involvement is eliminated: zero CPU threads per communicator, freeing host cores for other use.
Multi-node “combine” performance is within 1–3% of NVSHMEM.

5. Hardware and Software Prerequisites

The following requirements must be met for GDAKI operation:

NIC: NVIDIA ConnectX-6 Dx or newer, or BlueField-2/3 DPU with GPUNetIO support.
PCIe/NVLink Topology: GPU and NIC must share the same root complex, ensuring peer-to-peer BAR mappings and low-latency path.
GPUs: Volta-class (sm_70+) or newer; CUDA toolkit 12.x or higher.
Kernel Modules and Drivers: nv_peer_mem or dmabuf for GPUDirect RDMA, OFED/MOFED stack matching NIC firmware (e.g., NVIDIA OFED 5.x).
Software Stack: DOCA SDK with appropriate doca-gpunetio library version aligned to firmware.

This enables direct access and control of NIC hardware resources from CUDA devices.

6. Integration into NCCL and MoE Libraries

GDAKI is integrated into NCCL as a plugin backend:

NCCL Core: At communicator initialization, the runtime detects GPUNetIO capability, loads the GDAKI plugin, allocates QPs/BARs, performs memory registration (for lkey/rkey and address handles), and distributes this context to all participating ranks.
Memory Windows: Application code registers CUDA pointers with NCCL via ncclCommWindowRegister, resulting in a ncclWindow_t handle via DOCA GPUNetIO.
DeepEP Example: Replaces NVSHMEM one-sided routines with ncclGin methods. Mapping pointer arithmetic to (window, offset) pairs and instantiating multiple communicators if required (e.g., 24 QPs per rank).
CUDA Kernel Usage: Importing <nccl_device_api.h>, device code calls ncclGin methods for put and signaling; host-side proxies or thread launches are not needed.

Integration is transparent for higher-level frameworks including PyTorch, JAX, and TensorRT-LLM, and typically requires only minor kernel changes.

7. Summary of Observed Benefits

GDAKI achieves the following experimentally validated benefits:

Sub-17 µs round-trip latency for small (4–128 B) messages.
Bandwidth scalability up to line-rate (≈400 Gb/s per link) across all tested collective modes.
Elimination of host-side synchronization and context switching, resulting in zero CPU thread overhead for communication progress.
Application-level performance (MoE on DeepEP) at or above that of competing approaches, and up to 9% latency reduction in multi-node scenarios with minimal code changes for adoption.
Fully device-side controllable collective operations, enabling tight fusion of communication and computation within CUDA kernels (Hamidouche et al., 19 Nov 2025).

GDAKI, in conjunction with DOCA GPUNetIO and NCCL's collective runtime, represents a tightly integrated architecture for efficient one-sided RDMA communication in GPU-centric distributed AI workloads.

Markdown Report Issue Upgrade to Chat

References (1)

GPU-Initiated Networking for NCCL (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPUDirect Async Kernel-Initiated (GDAKI).