100 GbE RDMA: Protocols, Hardware & Performance

Updated 2 February 2026

100 GbE RDMA is a network transport technology offering direct, one-sided memory transfers with sub-microsecond latencies and 100 Gbps data rates.
It employs diverse protocol variants such as RoCEv2, CXLoE, and EDM, integrating hardware offloads and custom datapaths to minimize processing delays.
Its system integration supports applications from data center memory disaggregation to ultrafast data acquisition, delivering measurable performance improvements.

100 Gigabit Ethernet Remote Direct Memory Access (100 GbE RDMA) is a high-throughput, low-latency network transport technology that enables direct, one-sided data movement between host memories over 100 Gbps Ethernet fabrics, bypassing operating system and CPU intervention. RDMA is pivotal for disaggregated memory architectures, high-performance data center fabrics, and ultrafast data acquisition systems. At 100 GbE speeds, RDMA implementations are distinguished by their protocol stack (e.g., RoCEv2, custom UDP/IP, CXL-over-Ethernet, PHY-level offload), hardware datapaths, and congestion and reliability mechanisms. The following sections provide a technical synthesis of 100 GbE RDMA across protocols, architecture, performance, system integration, challenges, and future research directions.

1. Protocol Architectures and Variants

100 GbE RDMA deploys several protocol variants, each with different layering, offload, and interoperability characteristics. The canonical standard is RoCEv2 (RDMA over Converged Ethernet v2), which encapsulates InfiniBand semantics in UDP/IP over Ethernet, leveraging "verbs" (work requests), queue pairs (QPs), and hardware offload for packetization, segmentation, and flow control (Hoefler et al., 2023). RoCE BALBOA and RASHPA-RDMA are open-source FPGA implementations tailored for SmartNICs and data acquisition, respectively, and demonstrate custom protocol enhancements for SmartNIC offloads and leaner headers (Heer et al., 27 Jul 2025, Mansour et al., 2018).

Emerging alternatives optimize for sub-microsecond latency and bypass conventional MAC layer processing. EDM moves memory-access protocol logic entirely into the Ethernet PHY (Physical Coding Sublayer, 66-bit blocks), eliminating the MAC, standard frames, and even Layer 2 switching for remote memory channels. This enables preemptive, fine-grained message onset and in-frame memory block insertion, drastically reducing serialization, gap, and scheduling overheads (Su et al., 2024).

CXL-over-Ethernet (CXLoE) encapsulates Compute Express Link native load/store flits inside custom Ethernet frames. This approach bypasses all RDMA user-visible verbs and exposes remote memory as native NUMA over Ethernet, achieving protocol-level transparency for applications (Wang et al., 2023). In ultrafast imaging, RoCEv2 in RC (reliable connection) mode is integrated directly with FPGA and MPSoC for wire-rate acquisition and streaming to host RAM (Villani et al., 26 Jan 2026).

2. Hardware and Datapath Implementation

At 100 GbE line rates, RDMA implementations impose high demands on hardware datapaths. Commercial NICs (e.g., Mellanox ConnectX series) deliver ASIC-level performance, but FPGA-based solutions are widely used for customization and SmartNIC integration (Heer et al., 27 Jul 2025, Mansour et al., 2018).

Typical hardware datapaths employ:

High-throughput MAC/PHY IP: 100 GbE CMACs clocked at ≥250 MHz and 512-bit AXI buses provide raw >100 Gbps link capacity (Heer et al., 27 Jul 2025, Mansour et al., 2018).
Custom or standard packet parsers: Full InfiniBand BTH decode, state-table lookup, and header manipulation in RoCE, or minimal headers in direct UDP/IP or CXLoE.
DMA engines: For PCIe/AXI interface to host or DRAM, with scatter/gather support and zero-copy semantics.
Credit and flow control: Distributed per-QP counters for work request credits, linked to ACK/NACK generation.
Application-specific blocks: For in-network encryption (AES), deep packet inspection (ML-DPI), or preprocessing (see Table below).

Stack Variant	Hardware Blocks	Offloads	Datapath Latency
RoCEv2/BALBOA	CMAC, BTH/RETH/ICRC, DMA, QP tables	Credit, congestion, AES, ML-DPI	Logic ~120 ns, MAC ~200 ns, end-to-end ~2–3.5 µs for 64B
EDM	PCS 66b block engine, PIM scheduler	Match/Grant, virtual circuits	Entire host-to-host RTT ~299 ns
CXLoE	CXL.mem agent, AXI cache, address mapping	Congestion, retry/ack, cache	415 ns (cache-hit), 1.97 µs (remote)
RASHPA-RDMA	AXI4-DMA, UDP/IP, light header	None (SW retr. only)	0.3–0.5 µs (FPGA–FPGA) (not incl. host)

Custom protocol stacks leverage BRAM for per-QP or block buffer allocation (e.g., RoCE BALBOA: up to 500 QPs, 512 KB/QP in HBM (Heer et al., 27 Jul 2025)), and integrate offloads for encryption, ML, or direct GPU DMA with negligible extra latency.

3. Performance Models and Quantitative Results

100 GbE RDMA is evaluated using metrics of round-trip latency (RTT), sustainable bandwidth, headroom requirements, and resource utilization. The following specific measurements are reported:

EDM (PHY-level RDMA): 64 B remote reads/writes at 299 ns RTT (host-to-host, FPGA testbed), with line-rate utilization for small (8 B) RREQs and 1 KB RRES (Su et al., 2024).
CXLoE: 64 B remote load at 1.97 µs RTT (uncached), reduced to 415 ns for on-FPGA cache hits. This is ~37% lower than industry RDMA baselines (~3.14 µs) and within 10% of bare one-sided RDMA latencies (Wang et al., 2023).
RoCE BALBOA (SmartNIC): 64 B RDMA write 1-way latency of 2.8 µs (FPGA–FPGA), 2.0 µs (ASIC–ASIC). Throughput saturates the wire at 100 Gb/s for 32 KiB messages (Heer et al., 27 Jul 2025).
RASHPA-RDMA: 90 Gbps (1 KB packets), 95 Gbps (≥32 KB), pipeline latency 250–400 ns, outperforming RoCEv2 by up to 70% for small packets (598 B) (Mansour et al., 2018).
Ultrafast Imaging Use Case: Sustained, continuous streaming at 95.6 Gb/s for 256-channel acquisition with RoCEv2, batch transfers (8 × 1 MiB in 671 µs), <±0.1 Gb/s throughput variation (Villani et al., 26 Jan 2026).

Analytical models decompose end-to-end RTT into host, network, and FPGA path delays; e.g.,

$T_{\text{RTT}} = T_{\text{host}} + T_{\text{nw}} + T_{\text{fpga}}$

where each term models serialization, pipe delays, cache lookup, and DRAM cycle times (Wang et al., 2023, Su et al., 2024).

4. Flow Control, Congestion Management, and Reliability

Traditional RoCEv2 employs per-QP credit flow control, queue-pair state machines (RESET–INIT–RTR–RTS), and PFC (802.1Qbb) for lossless priorities. Credits limit in-flight data to avoid buffer overflow, while PAUSE frames (upon threshold) force upstream quenching (Hoefler et al., 2023).

Go-back-N retransmission is triggered by NACKs on Packet Sequence Number (PSN) gaps, which can severely penalize incast and shallow-buffer topologies. ECC and FEC models show increasing vulnerability as link BERs rise with higher bit rates (PAM4, 50 G/lane) (Hoefler et al., 2023). Synchronous per-QP ACK-clocking and DCQCN/TIMELY congestion control are offloaded into hardware. Selective retransmit and IRN-style designs eliminate PFC by employing large replay windows and SACK logic (Hoefler et al., 2023).

EDM eliminates L2/MAC queueing by implementing a PIM-based centralized scheduler in the switch PHY, configured for maximal matching per scheduling round, with per-port, per-dst logical queues. Scheduling latency is bounded by $T_{\text{sched}} = 3\log N / R_{\text{clk}}$ (e.g., $N=512$ , $R_{\text{clk}}=3$ GHz $\to$ 9 ns scheduler) (Su et al., 2024). No PFC, TCP, or DCQCN is needed.

Loss detection in custom stacks may use sequence-numbered registers (e.g., RASHPA-RDMA: 1024-bit; Figure 1 in (Mansour et al., 2018)): on gap detection events, retried by software, not hardware.

5. System Integration and Use Cases

100 GbE RDMA's integration model spans host CPUs, PCIe, FPGAs, and SmartNICs. RoCEv2 is natively supported on commodity NICs and Linux/RDMA stacks. CXL-over-Ethernet and EDM require custom FPGA/NIC logic but maintain transparent memory expansion—e.g., in CXLoE, the host system detects disaggregated memory as a NUMA node, with no software or application changes (Wang et al., 2023).

Use cases extend beyond data center memory disaggregation to data acquisition and streaming:

Medical Imaging: FPGA–host systems bypass local buffering, employing direct RDMA writes for sustained, unbuffered streaming, supporting >256 ADC channels at wire rate (Villani et al., 26 Jan 2026).
Accelerated ML/Data Preprocessing: 100 GbE SmartNICs with embedded ML-DPI and GPU DMA enable on-the-fly data filtering, transformation, and direct-to-GPU transfer, reducing host-side pipeline latency by 20–135 µs per batch (Heer et al., 27 Jul 2025).
Memory Pooling and Disaggregation: CXLoE and EDM enable rack-scale or cross-rack expansion, leveraging pure Ethernet switching and eliminating the need for application refactoring (Wang et al., 2023, Su et al., 2024).

6. Scalability and Engineering Trade-offs

Scaling 100 GbE RDMA faces state, buffer, and header overhead limitations. RoCEv2 header size (~66 B) leads to high bandwidth waste for small payloads; EDM’s 66-bit block protocol and frame interleaving mitigates this by supporting sub-64 B granularity and intra-frame preemption (Su et al., 2024). RASHPA-RDMA uses lean UDP+8 B header for minimal processing overheads (Mansour et al., 2018).

QP context storage, flow control tables, buffer management, and retransmit logic must scale to thousands of flows in hyperscale settings (Hoefler et al., 2023, Heer et al., 27 Jul 2025). For instance, per-QP HBM usage in BALBOA is proportional to outstanding WQE/buffer size. In high-radix deployments, headroom per switch/priority is sizable—e.g., $H = B \times RTT + MTU$ yields ≈ 0.46 MB per-priority for 100 GbE, 3-tier fat tree (Hoefler et al., 2023).

Congestion control extensions—TIMELY, HPCC, per-flow backpressure, header compression—address tail-latency and incast pathologies (Hoefler et al., 2023). Security and isolation levels in standard RoCE require per-QP IPsec or sRDMA-type enhancements; CXLoE and EDM currently rely on physical security, with future research needed for per-NUMA or address-range isolation.

7. Limitations, Open Challenges, and Future Directions

RoCE on 100 GbE exposes scaling bottlenecks for buffering, congestion, header overhead, and multi-tenancy. PFC-induced head-of-line blocking, go-back-N's inefficiency, and lack of in-NIC programmable logic prompt the development of more advanced or streamlined RDMA variants (Hoefler et al., 2023). The EDM and CXL-over-Ethernet architectures demonstrate feasibility for >10× lower latency (≈300 ns) by eliminating or repurposing lower network layers and deploying advanced scheduling logic (Su et al., 2024, Wang et al., 2023).

Adoption of PHY-level integration (EDM), native load/store encapsulation (CXLoE), and dedicated FPGA SmartNICs (BALBOA, RASHPA) will depend on standardization, cross-vendor interoperability, and deployment cost. Current research fronts involve:

Programmable in-network processing and SmartNIC APIs for line-rate security, compression, or ML operators (Heer et al., 27 Jul 2025).
Fine-grained congestion and fairness solutions without PFC, leveraging per-flow backpressure and hardware ARQ+FEC (Hoefler et al., 2023).
Bufferless, zero-queue RDMA architectures at scale, including PIM-based crossbar switch schedulers (Su et al., 2024).
Integration with new memory technologies (CXL, CXL.mem) and direct NUMA-like exposure (Wang et al., 2023).
Wire-rate, ultra-low-latency streaming for scientific and medical acquisition systems (Villani et al., 26 Jan 2026).

A plausible implication is that the evolution of 100 GbE RDMA will increasingly decouple protocol semantics from legacy transport stacks, with a strong shift toward programmable, hardware-level logic for congestion, scheduling, and virtualization. This trajectory is supported by empirical demonstrations of FPGA/ASIC deployment with 100+ port scalability, near-line-rate throughput, and stable low tail-latency under all-to-all load (Su et al., 2024).