RecoNIC Platform: FPGA-based RDMA Accelerator
- RecoNIC Platform is an FPGA-based SmartNIC that integrates a RoCEv2-compliant RDMA engine and programmable compute blocks to accelerate network-attached computation.
- It employs a hybrid hardware/software architecture, featuring dual 100 GbE MACs, DMA infrastructure, and crossbar interconnects to minimize CPU overhead and data copies.
- RecoNIC supports multiple programming models including RTL, HLS, and P4, enabling diverse applications such as ML inference, graph analytics, and in-flight packet processing.
RecoNIC is an FPGA-based SmartNIC platform developed for high-throughput, low-latency compute acceleration in data center environments. It integrates a full RoCEv2-compliant RDMA offload engine with programmable compute blocks, enabling data processing near the network endpoint while minimizing CPU involvement and data copy overheads. The platform provides tightly coupled hardware and software components, supports direct access to both host and device memory, and enables the implementation of network-attached accelerators using RTL, HLS, or P4 programming abstractions. RecoNIC is open-sourced to foster experimentation with RDMA-centric compute models across a range of distributed systems workloads (Zhong et al., 2023).
1. Hardware and Software Architecture
RecoNIC’s hardware is centered around an FPGA hosting two 100 GbE MACs, an ERNIC IP RDMA engine, DMA infrastructure (QDMA), programmable compute blocks, and 16 GB of on-board DDR4. The architecture routes network packets through packet classification logic, directing traffic to the RDMA engine or DMA paths.
Programmable compute blocks are realized as:
- Lookaside Compute (LC): Provides AXI4-Lite for control and AXI4 for memory-mapped data access.
- Streaming Compute (SC): Connects via AXI4-Stream for high-throughput data processing.
All masters (RDMA, DMA, compute blocks) connect through crossbars (mem_crossbar, sys_crossbar) to both the on-board DDR4 and the PCIe host interface (QDMA). Host software comprises an onic-driver (non-RDMA), reconic-mm (device memory), and libreconic APIs for direct and RDMA-based interactions.
2. RDMA Offload Engine Design and Operation
The embedded RDMA engine, based on AMD’s ERNIC IP, implements RoCEv2 and fully offloads transport-layer RDMA verbs. It interfaces directly with host CPUs (over PCIe) and with on-FPGA compute blocks, supporting the following verbs:
- RDMA Read/Write
- Send
- Write with Immediate Data
- Send with Immediate Data
- Send with Invalidate
Queue Pairs (QPs): Each QP comprises a Send Queue (SQ), Receive Queue (RQ), and Completion Queue (CQ). Work Queue Elements (WQEs) reside in either host or device memory. Communication involves writing to doorbell registers (to initiate TX) and polling CQ consumer indices for completion.
Typical Data Path for RDMA Read:
- A Read WQE is posted to SQ and the corresponding doorbell is triggered.
- ERNIC fetches WQE, constructs and sends the RoCEv2 Read Request packet.
- Read Response is received, and the ERNIC writes payloads to local memory.
- Completion is posted to CQ for notification.
Throughput optimization includes batching WQEs—amortizing PCIe transactions—and deep pipelining. Initial WQE fetch incurs roughly 680 ns latency, with subsequent WQEs served every ∼40 ns. Throughput for batch-read requests at 32 KB reaches near line-rate at ~92 Gb/s, while small message latency for reads is approximately 400 ns.
3. Memory Access Model
RecoNIC supports fine-grained isolation and redirection of memory operations:
- Address Translation & Protection: A 12-bit MSB mask distinguishes device addresses from host addresses. The crossbar arbitrates memory transactions from all masters, routing AXI4-MM requests appropriately.
- Queue Pair (QP), buffer, and register placement: These can be instantiated in either host or device memory, regulated by address mapping logic.
- Compute Block Access Patterns:
- LC kernels execute AXI4-MM transactions, using the RDMA engine to fetch or write data from/to remote memory, which is then written or read via the crossbar.
- SC kernels process streaming packet payloads directly in-line as they traverse network to compute pipeline.
This architecture enables zero-copy data transfer and flexible placement of control/data structures across memory hierarchies.
4. Supported Programming Models
RecoNIC enables accelerator development using the following abstractions:
- Register-Transfer Level (RTL): Custom Verilog/VHDL can be mapped onto LC or SC blocks, enabling maximal performance and architectural control.
- High-Level Synthesis (HLS): C-based HLS kernels, exemplified by a systolic-array matrix multiplication implementation, interface using AXI and AXI-Lite ports for control and memory-mapped access. For example, an HLS C kernel receives operating parameters via a control FIFO, processes data matrices, and writes results to output buffers using pragma-directed dataflow for pipeline parallelism.
- Vitis Networking P4: Streaming compute blocks may be designed with network protocol parsers written in P4, which is compiled to RTL and instantiated into FPGA data paths. This model enables custom packet classifiers by parsing headers (Ethernet, IP, UDP, RoCEv2) and setting match-action metadata for downstream modules.
This tri-modal programming support is fundamental for rapid prototyping as well as production deployment of custom accelerators within the data plane (Zhong et al., 2023).
5. Performance Characterization
Benchmarks were performed using pairs of AMD Alveo U250 FPGAs (PCIe 3.0 ×16, 100 Gb/s Ethernet links) and Linux 5.4/Vivado 2021.2.
DMA Host↔Device Memory:
- Host→NIC DDR4: 13.07 GB/s (82.5% of PCIe peak throughput).
- NIC→Host DDR4: observed latencies in the range 600–964 ns for ≤2 KB transfers.
RDMA Read/Write:
- Batched requests (N=50) reach line-rate (≈90 Gb/s), with small-message read latency ≈0.4 μs.
- Single-request throughput is lower, with higher per-operation latency.
| Scenario | Throughput | Latency (small messages) |
|---|---|---|
| RDMA Read (single) | ≈18 Gb/s | ≈4 μs |
| RDMA Read (batch) | ≈89 Gb/s | ≈0.4 μs |
| RDMA Write (single) | ≈20 Gb/s | ≈3.5 μs |
| RDMA Write (batch) | ≈90 Gb/s | ≈0.45 μs |
Throughput is calculated by the formula:
6. Representative Use Cases and Open-Source Distribution
The open-source release (https://github.com/Xilinx/RecoNIC) includes:
- DMA tests for host-device memory transfers.
- RDMA functional tests covering Read/Write/Send and batching.
- Networked systolic-array matrix multiply between peers using offloaded computation.
- Packet classification with streaming compute blocks.
Sample applications leveraging RDMA–FPGA integration include:
- Network-attached ML inference: RDMA fetches model weights, executing convolutional neural networks directly on the NIC.
- Distributed graph analytics: Enables remote vertex fetching and on-NIC graph traversal for partitioned graphs.
- In-flight packet telemetry/filtering: Facilitates low-latency packet classification, sampling, and forwarding.
RecoNIC’s design—a combination of deeply programmable datapaths, flexible RDMA offload, and support for a broad set of programming models—enables the exploration of zero-copy network-attached computation and serves as a foundational tool for next-generation data center accelerator research (Zhong et al., 2023).