JEDI Architecture: Real-Time GNN for Particle Physics
- JEDI Architecture is a family of graph neural networks engineered for real-time jet tagging in particle physics, balancing accuracy with low-latency hardware constraints.
- It implements algebraic linearization to reduce pairwise interaction computations by aggregating global context, significantly lowering computational complexity.
- The design leverages fine-grained quantization and multiplier-free inference, enabling efficient FPGA deployment with sub-100 ns latency.
JEDI (“Jet Identification via Interaction Networks” and its linear variant, JEDI-linear) refers to a family of graph neural network (GNN) architectures specifically designed for resource-constrained, low-latency deployment in real-time particle physics data environments, with principal application to hardware trigger systems for jet tagging in the CERN High-Luminosity Large Hadron Collider (HL-LHC) (Que et al., 21 Aug 2025).
1. Motivation and Evolution: From JEDI-net to JEDI-linear
JEDI-net implements per-jet classification by modeling each jet as a fully connected, directed graph of its constituent particles. For each particle pair , an edge embedding is computed, where denotes the -th particle's features and is a learnable function (typically a multi-layer perceptron). For node , the interaction-aware embedding is then .
However, this classical interaction network yields complexity, with the number of particles and the per-pair cost of . For realistic –$128$, this all-to-all pattern poses severe throughput and memory bottlenecks in FPGA-based pipelines, directly impinging on the sub-100 ns latency requirements for trigger applications at the HL-LHC.
JEDI-linear addresses these hardware bottlenecks by restricting to a single affine layer. This admits an exact algebraic factorization, replacing all explicit pairwise computation with shared, global transforms, thus reducing the complexity to without sacrificing the network's global information propagation capabilities.
2. Algebraic Linearization and Global-Context Aggregation
Let be affine: , with trainable parameters . Then:
Rescaling by and neglecting corrections for larger , the update simplifies to:
Practical implementation proceeds as:
- Compute for each node.
- Compute global context .
- Broadcast to all nodes, adding it to local bias.
- The updated per-node embeddings serve as the network’s new representations for downstream operations.
This approach maintains the capacity to capture both local and global structure, as all node features contribute to each output via the global pooled term, but with single-pass, massively parallelizable operations.
3. Linear Complexity Implementation and Hardware Mapping
JEDI-linear’s computational upper bound is per layer, where is the input feature size and the embedding dimension. All steps — feature projection, mean-pooling, global context computation, and combination — can be realized as fully pipelined, statically timed modules in high-level synthesis (HLS) or hand-tuned RTL/Verilog.
Typical dataflow stages include:
- Input projection (per-node)
- Global mean-pool (one pass per graph)
- Dense layer for global context (one pass per graph)
- Broadcast and elementwise addition (all nodes)
- Output head (e.g., average pool, MLP, logits)
Each stage is implemented as a dedicated hardware module, with pipeline registers balanced for single-cycle initiation interval (II). This fully unrolled architecture ensures deterministic latency, facilitating system-level resource planning and real-time constraints.
4. Fine-Grained Quantization and Per-Parameter Mixed-Precision Bitwidths
To minimize logic usage and interconnect, JEDI-linear adopts High-Granularity Quantization (HGQ) — a scheme assigning dynamic bitwidths to every trainable parameter. The training loss is augmented with an “Effective Bit Operations” (EBOP) estimator:
Differentiable surrogates for bitcount induce sparsity: less-critical weights are aggressively quantized (or zeroed, for pruning), while critical weights retain higher precision. The resulting parameter distribution concentrates mass in the $1$–$3$ bit regime, with many weights dropped altogether. Every instance trains along the accuracy–resource Pareto front, allowing designers to tune models specifically for FPGA footprint constraints.
In hardware, this per-parameter quantization allows custom-precision multiplier mapping at the logic level, eliminating waste and permitting single-multiplier fabric reuse, or elimination where weights are pruned.
5. Multiplier-Free Inference with Distributed Arithmetic
All constant-matrix–vector multiplies (CMVMs) in JEDI-linear are transformed via distributed arithmetic (DA), using the da4ml framework. DA recasts CMVMs as linear combinations of precomputed lookup tables (LUTs) and shift-add networks:
- The weights are symbolically expanded into bit-slice patterns.
- Common subexpressions are factored to maximize LUT sharing across neurons.
- All computation maps to adders, subtractors, and shifters exclusively; digital signal processor (DSP) blocks are not needed.
This yields:
- Zero DSP block usage (critical on resource-constrained FPGAs).
- High maximum clock frequency ( MHz).
- True one-cycle II, with regular, predictable routing and minimal control logic.
6. Hardware Performance and Resource Utilization
The architecture has been synthesized and tested on AMD VU13P FPGAs (one Super Logic Region) with stringent latency and resource constraints:
| Jet Size / Features | Latency (ns) | LUTs (×10³) | FFs (×10³) | DSPs |
|---|---|---|---|---|
| 8 / 16 | 67 | — | — | 0 |
| 32 / 16 | 79 | 147 | 71 | 0 |
| 64 / 16 | 93 | — | — | 0 |
| 128 / 16 | 110 | — | — | 0 |
Relative to state-of-the-art alternatives (LL-GNN, Ultrafast JEDI-net), JEDI-linear achieves:
- $3.7$– lower latency,
- up to fewer LUTs,
- full elimination of all DSP blocks (contrast: –$8700$ DSPs in competing designs),
- up to lower II (contrasted with II150 clk in prior work), while yielding higher classification accuracy due to improved interaction modeling and rigorously regularized quantization (Que et al., 21 Aug 2025).
7. Dataflow, Pseudocode, and Practical Considerations
The core operator (“global information gathering”) is implemented as:
1 2 3 4 5 6 7 8 9 10 |
// Inputs: I[N_O×P] // per-particle feature matrix
// 1) Input projection
X[i] = Dense₁(I[i]) for i in 1..N_O
// 2) Global context extraction
mean_X = (1/N_O) * sum_{i=1}^{N_O} X[i]
G = Dense₂(mean_X) // length-D_E vector
// 3) Broadcast and merge
for i in 1..N_O:
E[i] = X[i] + G // element-wise addition
// downstream: Average pool E over i, then MLP head → logits |
Each of these stages is realized by a dedicated module in the hardware pipeline, using parameter bitwidths defined at synthesis time (according to learned quantization profiles).
The entire accelerator is fully unrolled, exploits partitioned memory access, and avoids any non-determinism or variable-path logic. No special hardware (DSPs, carry chains) is required, and resource allocation is statically schedulable.
8. Impact and Adoption
JEDI-linear demonstrates that a fully algebraic linearization of interaction networks, combined with aggressive per-parameter quantization and distributed arithmetic, enables GNN inference at scale ( up to $128$ particles) with sub-100 ns latency and order-of-magnitude resource reductions over legacy designs. Its template-based, open-source release is intended to facilitate integration into future trigger systems and resource-constrained scientific applications where real-time decision-making on graph data is essential.
All claims and technical details are substantiated in the originating publication (Que et al., 21 Aug 2025).