DOCA GPUNetIO: Direct GPU-to-NIC Networking
- DOCA GPUNetIO is a device-side networking API that enables direct, low-latency GPU-to-NIC RDMA transfers via InfiniBand and RoCE, bypassing CPU mediation.
- It is integrated into NCCL’s GPU-Initiated Networking architecture, allowing CUDA kernels to execute one-sided RDMA operations through inlined device calls and mapped NIC registers.
- The approach achieves minimal latency (~16.7 µs RTT for small messages) and high throughput, requiring modern NVIDIA ConnectX-6 Dx NICs and the latest GPU drivers for optimal performance.
DOCA GPUNetIO is a device-side networking API providing direct GPU-to-NIC communication for low-latency, fine-grained GPU-initiated data transfers over InfiniBand and RoCE. Integrated into the NVIDIA Collective Communications Library (NCCL) through its GPUDirect Async Kernel-Initiated (GDAKI) backend, it enables CUDA kernels to initiate one-sided RDMA operations without CPU mediation, allowing unprecedented overlap of computation and network for modern AI workloads, particularly those employing Mixture-of-Experts (MoE) architectures. DOCA GPUNetIO is a component of NVIDIA's Data Center on a Chip Architecture (DOCA) software suite; its protocol exposes networking primitives natively to the device by mapping network interface controller (NIC) resources—such as doorbell and queue memory regions—into the GPU's address space (Hamidouche et al., 19 Nov 2025).
1. Architectural Integration within NCCL GIN
DOCA GPUNetIO is a foundational element of the three-layer GPU-Initiated Networking (GIN) architecture in NCCL 2.28. The architecture comprises:
- NCCL Core (host): Responsible for communicator creation, collective memory window registration (via
ncclCommWindowRegister), and exchange of GPUNetIO-related context (queue-pair numbers, packet sequence numbers, remote keys) among participating ranks. The host dynamically loads the GDAKI network plugin if DOCA GPUNetIO is available; fallback is provided by a Proxy backend. - Device GIN API (GPU): Exposed as the
ncclGinclass, this API includes device-side methods such asput(),signal(),flush(), andwaitSignal(). Under GDAKI, these compile to inlined calls to DOCA GPUNetIO device verbs, allowing kernels to initiate communication directly from the device. - GIN Network Plugin: The GDAKI plugin implements both host- and device-side interfaces using DOCA GPUNetIO. On the host, it creates GPUNetIO contexts, queue pairs (QPs), completion queues (CQs), and registers memory regions. On the GPU side, it provides kernel-initiated verbs, granting CUDA kernels access to NIC control registers and queue structures via memory-mapped BAR2 regions.
Typical data flow for a device-initiated put operation is as follows:
| Stage | Host Side | Device Side |
|---|---|---|
| Communicator Setup | doca_gpunetio_create_context() |
- |
| Resource Registration | Sends/recvs/CQs registered, QP/CQ allocated | - |
| Window/Peer Exchange | Gathers/publishes rkeys, QPNs, PSNs | - |
| Put Invocation | - | Slot reserved via atomicInc; WQE populated and made visible using __threadfence_system; doorbell written; NIC DMA engine streams data; optional remote signal deposited |
This architecture removes host-side bottlenecks and enables direct pipeline construction between compute and communication on the device.
2. DOCA GPUNetIO API and Device Semantics
The GPUNetIO interface provides distinct APIs on the host and device sides, enabling end-to-end GPU-initiated networking:
- Host API (as invoked by the GDAKI plugin):
doca_gpunetio_create_context(dpu_index, &ctx): Allocates and memory-maps NIC resources into the GPU address space.doca_gpunetio_register_mr(ctx, gpu_ptr, size, &mr): Registers GPU memory for RDMA, obtaining an rkey.doca_gpunetio_create_qp(ctx, peer_gid, &qp): Establishes RDMA QPs, returning QPN/PSN and peer information.doca_gpunetio_create_cq(ctx, depth, &cq): Creates a completion queue for tracking work completion.
- Device API (inlined in
ncclGinmethods):DOCA_GPI_PostSend(ctx, qp, wr_slot, wr_size, args, wrid): Posts a send WQE directly from device.DOCA_GPI_DoorbellRing(ctx, qp, wqe_count): Rings the NIC doorbell to prompt new WQE fetch.
The design ensures all operations are asynchronous from the device perspective; completions can be observed through device-side polling on CQEs, and remote notification is supported via inline immediates.
Example pseudo-code outlining the device-side operation flows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
__device__ void ncclGin::put(…) { uint32_t slot = atomicInc(&ctx->send_head, ctx->send_queue_size); void* wqe_ptr = ctx->bar2_base + slot * WQEBB_SIZE; WQE* wqe = (WQE*)wqe_ptr; wqe->opcode = DOCA_OP_RDMA_WRITE; wqe->src_address = (uint64_t) srcWindow.ptr + srcOffset; wqe->dst_address = remote_peers[peer].base + dstOffset; wqe->length = bytes; wqe->rkey = remote_peers[peer].rkey; if (signalAction) { wqe->flags |= DOCA_SEND_FLAG_INLINE_IMM; wqe->imm_data = signalId; } __threadfence_system(); DOCA_GPI_PostSend(ctx, qp, wqe_ptr, WQEBB_SIZE, &wqe->args, &wrid); DOCA_GPI_DoorbellRing(ctx, qp, 1); } |
3. Memory, Work Queue, and Completion Management
Memory and queue management in DOCA GPUNetIO/NCCL GIN is governed by strict, device-oriented conventions:
- Memory Windows: Upon
ncclCommWindowRegister, symmetric RDMA regions are allocated and registered viadoca_gpunetio_register_mr, yielding local handles and rkeys. The (base address, size, rkey) for every participating rank is distributed such that device kernels can address remote memory as . - Work Queue Rings: Each device context contains a circular buffer of WQE slots in BAR2. The next slot is reserved using atomic increment operations on a 32-bit head pointer in GPU DRAM; wrapping is modulo the queue depth . Each slot is 64 B, enforcing strict disjoint slot ownership for lock-free concurrency.
- Completion Queues: CQ rings (32 B per CQE) are likewise mapped in BAR2. Device code polls on CQEs, matching
wridvalues to detect local completion; remote completion is observed via polling a remote signal buffer. - Concurrency: Atomic slot reservation and
__threadfence_systemprovide ordering and consistency; multiple device threads can safely post WQEs in parallel as slot usage is non-overlapping.
4. Measured Performance and Theoretical Bounds
The latency of device-initiated operations is modeled as:
where:
- (PCIe write to NIC BAR)
- (NIC WQE fetch and RDMA setup)
- (e.g., 200 GB/s link bandwidth)
Measured ping-pong round-trip time (RTT) for small messages (4–128 B) is , in agreement with benchmark data [(Hamidouche et al., 19 Nov 2025), Fig. 12]. For messages MB, bandwidth achieves within 5% of the raw NIC aggregate (approximately 400 GB/s per link).
Bandwidth can be expressed as:
Proxy fallback incurs an additional ~1–2 μs of latency, yielding RTTs of 18.0 μs for similar message sizes under the Proxy backend (Hamidouche et al., 19 Nov 2025).
5. Operational Context within the NCCL Runtime
DOCA GPUNetIO is engaged by the NCCL runtime during communicator initialization. The process is as follows:
- Capability Detection: NCCL probes for DOCA GPUNetIO using GDAKI plugin routines.
- Context Setup: Upon availability, the plugin constructs contexts and QPs, and registers device memory regions as described earlier.
- Window Registration: Execution of
ncclCommWindowRegisterresults in RDMA buffers being registered. - Handle Distribution: A device communicator handle is eventually returned to the application.
Within collective operations (e.g., device API-initiated ncclAllReduce), the expected sequence of put, flush, and signal calls by device kernels maps directly to asynchronous DOCA GPUNetIO verbs. This eliminates the need for host-initiated RDMA verbs or synchronization.
If DOCA GPUNetIO support is not available (due to unsupported hardware or user request with NCCL_GIN_BACKEND=Proxy), the runtime falls back to a host-proxy model. This enqueues communication descriptors on lock-free GPU→CPU queues, with host threads orchestrating standard verbs-based RDMA operations. While maintaining semantics, it incurs measurable latency penalties.
For multi-GPU nodes, NCCL automatically selects among GIN modes: intra-node traffic employs LSA (load/store) or Multimem backends over NVLink/PCIe, while inter-node traffic dispatches to GDAKI or Proxy as appropriate.
6. Hardware, Deployment, and Practicalities
Deployment of the GDAKI backend with DOCA GPUNetIO mandates an NVIDIA ConnectX-6 Dx or newer NIC, CUDA 12.2 or later, and a GPUDirect RDMA-compatible system configuration (including the correct kernel modules and PCIe topology). NIC and GPU must share a PCIe root complex for maximal efficiency and BAR2 mapping.
Key practical benefits include:
- Minimal Latency: Achieves the lowest-possible message RTTs (~16.7 μs for small payloads).
- Zero CPU Involvement: All communication flows directly from device logic, ideal for MoE primitives, JAX/Triton fusions, and tightly coupled computation/communication loops.
- One-sided RDMA with Notification: Native support for inline immediate data attached to WQEs enables remote signaling patterns.
Limitations include the requirement for modern NICs and up-to-date driver stacks, constraints on WQE ring sizing (ensuring QDEPTH is not overcommitted), and the necessity for device-side tooling for debugging due to absence of a host-side proxy.
Recommended deployment is in production HPC and AI clusters equipped with modern ConnectX hardware; developmental or legacy systems can utilize the Proxy backend. Tuning of queue depth and context count per kernel stream is essential for maximizing throughput and concurrency.
In summary, DOCA GPUNetIO enables CUDA kernels to directly manage InfiniBand/RoCE transfers within NCCL GIN, supporting asynchronous, one-sided RDMA with remote notification and full integration into NCCL's topology- and collective-aware runtime. This underpins low-latency, device-driven communication essential for modern distributed AI workloads (Hamidouche et al., 19 Nov 2025).