Papers
Topics
Authors
Recent
Search
2000 character limit reached

JEDI Architecture: Real-Time GNN for Particle Physics

Updated 8 February 2026
  • JEDI Architecture is a family of graph neural networks engineered for real-time jet tagging in particle physics, balancing accuracy with low-latency hardware constraints.
  • It implements algebraic linearization to reduce pairwise interaction computations by aggregating global context, significantly lowering computational complexity.
  • The design leverages fine-grained quantization and multiplier-free inference, enabling efficient FPGA deployment with sub-100 ns latency.

JEDI (“Jet Identification via Interaction Networks” and its linear variant, JEDI-linear) refers to a family of graph neural network (GNN) architectures specifically designed for resource-constrained, low-latency deployment in real-time particle physics data environments, with principal application to hardware trigger systems for jet tagging in the CERN High-Luminosity Large Hadron Collider (HL-LHC) (Que et al., 21 Aug 2025).

1. Motivation and Evolution: From JEDI-net to JEDI-linear

JEDI-net implements per-jet classification by modeling each jet as a fully connected, directed graph of its constituent particles. For each particle pair (i,j)(i,j), an edge embedding Eij=fR(IiIj)E_{ij} = f_R(I_i \Vert I_j) is computed, where IiI_i denotes the ii-th particle's features and fRf_R is a learnable function (typically a multi-layer perceptron). For node ii, the interaction-aware embedding is then Eˉi=jiEij\bar{E}_i = \sum_{j\neq i} E_{ij}.

However, this classical interaction network yields O(NO2CfR)O(N_O^2 \cdot C_{f_R}) complexity, with NON_O the number of particles and CfRC_{f_R} the per-pair cost of fRf_R. For realistic NO=30N_O=30–$128$, this all-to-all pattern poses severe throughput and memory bottlenecks in FPGA-based pipelines, directly impinging on the sub-100 ns latency requirements for trigger applications at the HL-LHC.

JEDI-linear addresses these hardware bottlenecks by restricting fRf_R to a single affine layer. This admits an exact algebraic factorization, replacing all O(NO2)O(N_O^2) explicit pairwise computation with shared, global transforms, thus reducing the complexity to O(NO)O(N_O) without sacrificing the network's global information propagation capabilities.

2. Algebraic Linearization and Global-Context Aggregation

Let fRf_R be affine: fR(IiIj)=W1Ii+W2Ij+Cf_R(I_i \Vert I_j) = W_1 I_i + W_2 I_j + C, with trainable parameters W1,W2,CW_1, W_2, C. Then:

Eˉi=ji(W1Ii+W2Ij+C)=(NO1)(W1Ii+C)+W2jiIj\bar{E}_i = \sum_{j \neq i} (W_1 I_i + W_2 I_j + C) = (N_O-1)(W_1 I_i + C) + W_2 \sum_{j \neq i} I_j

Rescaling by 1/NO1/N_O and neglecting O(1/NO)O(1/N_O) corrections for larger NON_O, the update simplifies to:

EˉiW2(1NOjIj)+W1Ii+C\bar{E}_i' \approx W_2 \left(\frac{1}{N_O} \sum_{j} I_j \right) + W_1 I_i + C

Practical implementation proceeds as:

  • Compute X=Dense1(I)X = \mathrm{Dense}_1(I) for each node.
  • Compute global context G=Dense2(MeanPool(X))G = \mathrm{Dense}_2(\mathrm{MeanPool}(X)).
  • Broadcast GG to all nodes, adding it to local X+X + bias.
  • The updated per-node embeddings {Eˉi}\{\bar{E}_i'\} serve as the network’s new representations for downstream operations.

This approach maintains the capacity to capture both local and global structure, as all node features contribute to each output via the global pooled term, but with single-pass, massively parallelizable operations.

3. Linear Complexity Implementation and Hardware Mapping

JEDI-linear’s computational upper bound is O(NO(2PDE+P))O(N_O \cdot (2P \cdot D_E + P)) per layer, where PP is the input feature size and DED_E the embedding dimension. All steps — feature projection, mean-pooling, global context computation, and combination — can be realized as fully pipelined, statically timed modules in high-level synthesis (HLS) or hand-tuned RTL/Verilog.

Typical dataflow stages include:

  • Input projection (per-node)
  • Global mean-pool (one pass per graph)
  • Dense layer for global context (one pass per graph)
  • Broadcast and elementwise addition (all nodes)
  • Output head (e.g., average pool, MLP, logits)

Each stage is implemented as a dedicated hardware module, with pipeline registers balanced for single-cycle initiation interval (II). This fully unrolled architecture ensures deterministic latency, facilitating system-level resource planning and real-time constraints.

4. Fine-Grained Quantization and Per-Parameter Mixed-Precision Bitwidths

To minimize logic usage and interconnect, JEDI-linear adopts High-Granularity Quantization (HGQ) — a scheme assigning dynamic bitwidths to every trainable parameter. The training loss is augmented with an “Effective Bit Operations” (EBOP) estimator:

Ltotal=Lpred+λEBOPs({biti})L_\mathrm{total} = L_\mathrm{pred} + \lambda \cdot \mathrm{EBOPs}(\{\mathrm{bit}_i\})

Differentiable surrogates for bitcount induce sparsity: less-critical weights are aggressively quantized (or zeroed, for pruning), while critical weights retain higher precision. The resulting parameter distribution concentrates mass in the $1$–$3$ bit regime, with many weights dropped altogether. Every instance trains along the accuracy–resource Pareto front, allowing designers to tune models specifically for FPGA footprint constraints.

In hardware, this per-parameter quantization allows custom-precision multiplier mapping at the logic level, eliminating waste and permitting single-multiplier fabric reuse, or elimination where weights are pruned.

5. Multiplier-Free Inference with Distributed Arithmetic

All constant-matrix–vector multiplies (CMVMs) in JEDI-linear are transformed via distributed arithmetic (DA), using the da4ml framework. DA recasts CMVMs as linear combinations of precomputed lookup tables (LUTs) and shift-add networks:

  • The weights are symbolically expanded into bit-slice patterns.
  • Common subexpressions are factored to maximize LUT sharing across neurons.
  • All computation maps to adders, subtractors, and shifters exclusively; digital signal processor (DSP) blocks are not needed.

This yields:

  • Zero DSP block usage (critical on resource-constrained FPGAs).
  • High maximum clock frequency (Fmax300F_\mathrm{max} \approx 300 MHz).
  • True one-cycle II, with regular, predictable routing and minimal control logic.

6. Hardware Performance and Resource Utilization

The architecture has been synthesized and tested on AMD VU13P FPGAs (one Super Logic Region) with stringent latency and resource constraints:

Jet Size / Features Latency (ns) LUTs (×10³) FFs (×10³) DSPs
8 / 16 67 0
32 / 16 79 147 71 0
64 / 16 93 0
128 / 16 110 0

Relative to state-of-the-art alternatives (LL-GNN, Ultrafast JEDI-net), JEDI-linear achieves:

  • $3.7$–11.5×11.5\times lower latency,
  • up to 6.2×6.2\times fewer LUTs,
  • full elimination of all DSP blocks (contrast: >5000>5000–$8700$ DSPs in competing designs),
  • up to 150×150\times lower II (contrasted with II\geq150 clk in prior work), while yielding higher classification accuracy due to improved interaction modeling and rigorously regularized quantization (Que et al., 21 Aug 2025).

7. Dataflow, Pseudocode, and Practical Considerations

The core operator (“global information gathering”) is implemented as:

1
2
3
4
5
6
7
8
9
10
// Inputs: I[N_O×P]    // per-particle feature matrix
// 1) Input projection
X[i] = Dense₁(I[i])      for i in 1..N_O
// 2) Global context extraction
mean_X = (1/N_O) * sum_{i=1}^{N_O} X[i]
G = Dense₂(mean_X)       // length-D_E vector
// 3) Broadcast and merge
for i in 1..N_O:
  E[i] = X[i] + G        // element-wise addition
// downstream: Average pool E over i, then MLP head → logits

Each of these stages is realized by a dedicated module in the hardware pipeline, using parameter bitwidths defined at synthesis time (according to learned quantization profiles).

The entire accelerator is fully unrolled, exploits partitioned memory access, and avoids any non-determinism or variable-path logic. No special hardware (DSPs, carry chains) is required, and resource allocation is statically schedulable.

8. Impact and Adoption

JEDI-linear demonstrates that a fully algebraic linearization of interaction networks, combined with aggressive per-parameter quantization and distributed arithmetic, enables GNN inference at scale (NON_O up to $128$ particles) with sub-100 ns latency and order-of-magnitude resource reductions over legacy designs. Its template-based, open-source release is intended to facilitate integration into future trigger systems and resource-constrained scientific applications where real-time decision-making on graph data is essential.

All claims and technical details are substantiated in the originating publication (Que et al., 21 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JEDI Architecture.