Fast Jet Tagging with MLP-Mixers on FPGAs

Published 5 Mar 2025 in physics.ins-det and cs.LG | (2503.03103v2)

Abstract: We explore the innovative use of MLP-Mixer models for real-time jet tagging and establish their feasibility on resource-constrained hardware like FPGAs. MLP-Mixers excel in processing sequences of jet constituents, achieving state-of-the-art performance on datasets mimicking Large Hadron Collider conditions. By using advanced optimization techniques such as High-Granularity Quantization and Distributed Arithmetic, we achieve unprecedented efficiency. These models match or surpass the accuracy of previous architectures, reduce hardware resource usage by up to 97%, double the throughput, and half the latency. Additionally, non-permutation-invariant architectures enable smart feature prioritization and efficient FPGA deployment, setting a new benchmark for machine learning in real-time data processing at particle colliders.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that compact MLP-Mixer models with High-Granularity Quantization and Distributed Arithmetic outperform state-of-the-art designs by significantly reducing resource usage and latency.
The methodology leverages non-permutation-invariant architectures and top-N particle ordering to enable precise bitwidth tuning and FPGA-friendly inference with sub-100 ns latencies.
Empirical results reveal that the MLP-Mixer achieves comparable or better accuracy than baselines while using up to 97% fewer LUTs and dramatically lowering computational demands.

Fast Jet Tagging with MLP-Mixers on FPGAs: An In-Depth Analysis

Motivation and Problem Statement

The Level-1 trigger (L1T) systems of LHC experiments require real-time decision-making under $\mathcal{O}(1\,\mu\mathrm{s})$ latency constraints, demanding powerful yet highly resource-efficient classifiers deployable on FPGAs. Jet tagging—classifying jets by origin (e.g., gluons, quarks, $W$ , $Z$ , top)—is critical for downstream physics, but standard architectures such as GNNs or Transformers are computationally prohibitive at L1. Previous hardware-aware models provide partial solutions; however, achieving a superior latency/accuracy/resource Pareto front while exploiting hardware characteristics remains an open task.

Dataset and Model Design

An LHC-like five-class jet dataset serves as the benchmark, leveraging up to 150 particles per jet and 16 per-particle features. The distribution of particles per jet, and the effect of $p_T$ thresholding, is provided for realism.

Figure 1: Distribution of the number of particles per jet in the jet tagging dataset, with and without a $p_T$ threshold.

The paper introduces compact, trigger-oriented MLP-Mixer models that ingest sequences of particle features, diverging from standard permutation-invariant designs. Inputs are top- $N$ particles ordered by $p_T$ to reflect practical firmware constraints and maximize the benefit of hardware-aware quantization.

The architecture follows the MLP-Mixer principle—alternate feature-wise and particle-wise linear transformations, but with only two mixer stages and reduced channel sizes for FPGA feasibility.

Figure 2: Compact MLP-Mixer architecture: four MLP blocks, a skip connection, and a fusion of dense and BN layers tailored for FPGA deployment.

Efficient Hardware Implementation: Quantization, Optimization, and Deployment

The models employ High-Granularity Quantization (HGQ), which uniquely enables per-parameter bitwidth tuning and unstructured pruning via surrogate gradient optimization, far surpassing layer-wise QAT in resource reduction and compression effectiveness. Distributed Arithmetic (DA) further transforms linear layers into hardware-friendly adder/shift networks, eliminating resource-heavy DSP utilization.

The deployment workflow integrates quantization, pruning, and DA into hls4ml/Vitis HLS for bit-accurate inference on FPGA.

Figure 3: End-to-end workflow for MLP-Mixer training, high-granularity quantization, firmware generation, and final FPGA synthesis.

Empirical Results

Accuracy and Model Efficiency

Full-precision MLP-Mixer models outperform the state-of-the-art JEDI-net baseline across all classes in AUC, even with fewer parameters and lower FLOPs, with the best $N_p=64$ variant yielding the highest scores.

Significant results include:

MLP-Mixers achieve comparable or better accuracy than JEDI-net while requiring only $\mathcal{O}(10^3)$ parameters and sub- $\mathcal{O}(10^5)$ FLOPs, compared to JEDI-net's $\mathcal{O}(10^4)$ parameters and $\mathcal{O}(10^8)$ FLOPs.
Increasing input particles improves performance up to saturation at $N_p=64$ , highlighting efficient utilization of input information.

FPGA Resource Utilization and Latency

Quantized, DA-optimized MLP-Mixers set new benchmarks for accuracy/resource/latency trade-offs, fully exploiting LUT-based inference with zero DSP usage, and consistently achieving sub-100 ns latencies and $>10^7$ jets/s throughput.

Figure 4: Accuracy-resource-latency Pareto front for quantized MLP-Mixer and JEDI-net models; MLP-Mixer is strictly superior.

At fixed accuracy, MLP-Mixer reduces LUT usage by up to 97% and latency by up to 53% compared to JEDI-net.
The non-permutation-invariant models support heterogeneous per-particle, per-feature bitwidth assignment, a property leading to an additional order-of-magnitude LUT savings at high input sizes.
Figure 5: The cost of enforcing permutation invariance: removing heterogeneous quantization leads to an order of magnitude higher resource demand.

Feature Selection and Interpretability

Analysis of per-channel and per-particle bitwidths reveals that HGQ enables the model to prioritize high-salience features and discard noise adaptively. For example, more bits are dedicated to kinematic variables and high- $p_T$ particles, in alignment with domain knowledge and physical constraints.

Figure 6: Layerwise average input bitwidths showcase data-driven selective quantization and dynamic feature prioritization capabilities of the HGQ-trained MLP-Mixer.

Architectural and Algorithmic Implications

Breakdown of permutation invariance is critical for hardware Pareto improvement, leveraging low-level trigger ordering guarantees to enable aggressive quantization and selective attention.
The direct mapping of mixer computations to DA further unlocks resource and latency reductions across all major FPGA families, a substantial advance over previous QAT and pruning approaches.
Comparative ablations demonstrate that even MLP baselines can outperform legacy FPGA-targeted Interaction Network and Deep Sets models when equipped with DA and HGQ.

Broader Significance and Future Perspectives

The deployment of MLP-Mixers with HGQ and DA establishes a new reference for edge AI in collider triggers, simultaneously maximizing inference throughput and minimizing power and hardware overheads without substantial accuracy compromise. These approaches generalize beyond HEP: selective quantization and non-permutation-invariance are broadly applicable for event sequence processing at the hardware edge (e.g., for rapid vision, audio, or anomaly detection at sensor frontends).

The results also inform architectural co-design: ensuring firmware-level data orderings, integrating adaptive quantization and DA in tooling, and customizing training for physical hardware cost functions. Additional gains are anticipated via more targeted regularization, co-optimized DA logic mapping, and fine-grained per-particle kernel differentiability.

Conclusion

MLP-Mixers, when co-designed with high-granularity quantization and distributed arithmetic, achieve a superior accuracy-latency-resource trade-off to all previous FPGA-deployable models for online jet tagging. The results demonstrate that exploiting architectural non-invariance, adaptive bitwidth allocation, and optimized arithmetic are critical for sustaining performance under extreme resource and real-time constraints.

This methodology portends practical impact on trigger upgrades and provides a reference for efficient, interpretable, and truly hardware-aware neural inference in other low-latency scientific and industrial applications.

Markdown Report Issue