Papers
Topics
Authors
Recent
Search
2000 character limit reached

NeuraLUT: Embedded LUT-Based Neural Networks

Updated 5 February 2026
  • Network-in-LUT (NeuraLUT) is a co-design paradigm that embeds entire multi-layer neural sub-networks within LUTs, replacing runtime arithmetic with precomputed lookups.
  • It achieves dramatic improvements in latency, energy, and area efficiency on FPGAs and custom CMOS through techniques like table compression, assembly trees, and structured pruning.
  • Key trade-offs involve managing exponential LUT growth with input bitwidth and fan-in while maintaining the accuracy and ultra-low latency required for edge inference.

Network-in-LUT (NeuraLUT) refers to a hardware-software co-design paradigm in which entire multi-layer neural network substructures are embedded within lookup tables (LUTs) for inference, replacing runtime arithmetic (multiplication, addition, MAC) with precomputed memory lookups and minimal additional logic. This methodology yields significant reductions in latency, area, and energy for edge accelerators—primarily FPGAs and custom CMOS hardware—by leveraging the parallel random-access capability and fine-grained control inherent to LUT-based architectures, at the cost of exponential table growth with the input bitwidth and fan-in. NeuraLUT models have been shown to achieve orders-of-magnitude improvements in latency and hardware utilization per accuracy point compared to traditional DSP or even highly optimized binary/ternary neural nets, provided network structure and partitioning are carefully co-designed to accommodate hardware constraints (Andronic et al., 2024, Guo, 9 Jun 2025, Lou et al., 14 Jan 2026).

1. Fundamental Concepts and Mathematical Model

In NeuraLUT designs, each “logical neuron” is not a simple threshold or weighted sum but an entire dense sub-network—often an MLP with residual connections—whose quantized input–output relation is exhaustively enumerated and stored as a truth table. Formally, if each sub-network N\mathcal N receives FF signals quantized to β\beta bits (x{0,1}βFx\in\{0,1\}^{\beta F}), then its function is

fLUT:{0,1}βFQβoutf_{\rm LUT}: \{0,1\}^{\beta F} \to \mathbb Q_{\beta_{\rm out}}

where Qβout\mathbb Q_{\beta_{\rm out}} denotes the quantized output value space. The actual computation of fLUTf_{\rm LUT} is realized by tabulating all 2βF2^{\beta F} input patterns during synthesis, evaluating the internal sub-network, and emitting the quantized result per row. This absorbs the non-linearity, quantization, batch-norm, and even skip connections within a memory lookup, such that at inference the physical wiring between layers is simply a sparse mesh of β\beta-bit buses (Andronic et al., 2024, Guo, 9 Jun 2025).

The overall design constraint is that the truth table for each LUT must be no larger than feasible (2βF2^{\beta F} entries); consequently, the inter-LUT fan-in FF and per-signal bit-width β\beta are tightly restricted.

2. Architectural Implementations and Hardware Realization

NeuraLUT may be mapped to hardware via various strategies:

  • FPGA Soft Logic: Each logical LUT (L-LUT) of up to KK input bits maps directly to one or more physical K-LUTs (P-LUTs) on the fabric. For larger FF or β\beta, LUT partitioning and multiplexing are used, with small adder or multiplexer trees to “join” subtables (Guo, 9 Jun 2025, Andronic et al., 2024, Andronic et al., 1 Apr 2025).
  • Custom CMOS (LUT-NA): The LUT-based neural accelerator (LUT-NA) employs small SRAM blocks to precompute MAC subproducts (using a divide-and-conquer splitting of nn-bit activations/weights into kk-bit subwords), which are then added with minimal logic. The only active logic per clock is a pair of kk-to-1 multiplexers, a barrel shifter, and a small adder. This design achieves up to 29.54×29.54\times area and 3.34×3.34\times energy reduction versus naïve LUT schemes (Sen et al., 2024).
  • Hierarchical Trees (NeuraLUT-Assemble): Assembly trees combine multiple small-fan-in L-LUTs into virtual “super-neurons,” with mixed-precision quantization at intermediate nodes and skip-connections across levels to stabilize training while minimizing the memory footprint (Andronic et al., 1 Apr 2025).

The practical hardware flow is: (1) full-precision training with enforced quantization/partitioning, (2) truth-table enumeration and compression, (3) Verilog or RTL emission targeting the relevant hardware, and (4) aggressive pipelining for sub-10ns latency.

3. Resource, Latency, and Energy Scaling

The central trade-off is between per-LUT memory size (exponential in βF\beta F) and the expressive power per inference cycle:

  • Resource Usage: For fixed β\beta and FF, the logical LUT size is 2βF2^{\beta F} entries; for NN total LUTs, hardware cost N2βF\sim N \cdot 2^{\beta F}. Partitioning, assembly, and table compression (e.g., ReducedLUT) can suppress this to some extent by decomposing truth tables, exploiting redundancy, or introducing “don’t care” entries where certain address patterns never occur during training or inference (Cassidy et al., 2024, Guo, 9 Jun 2025).
  • Latency: Fully unrolled NeuraLUT designs realize inference in as little as $2-12$ns total, limited only by LUT readout and interconnect delays, not arithmetic critical paths (Guo, 9 Jun 2025, Andronic et al., 2024, Andronic et al., 1 Apr 2025).
  • Energy Efficiency: By eschewing multipliers/DSPs and using dense memory lookups, dynamic power is reduced—up to 6.7×6.7\times compared to highly optimized binary nets, and more compared to floating-point or even int8 MAC-based accelerators (Wang et al., 2019, Andronic et al., 2024).

4. Methodological Enhancements: Pruning, Assembly, and Compression

Several methodological developments enhance baseline NeuraLUT architectures:

  • Structured Pruning & Logic Shrinkage: Techniques such as logic shrinkage learn to prune LUT inputs per netlist location, resulting in a final accelerator with variable input sizes and improved packing (up to 2.7×2.7\times area and 1.3×1.3\times energy reduction over random K-LUT assignments) (Wang et al., 2021).
  • Assembly Trees: NeuraLUT-Assemble builds large virtual neurons from trees of small-fan-in LUTs with skip-connections and layerwise mixed-precision. This circumvents the exponential table growth, enabling high expressivity while keeping hardware requirements tractable; area-delay product reductions of 1462×14-62\times compared to earlier NeuraLUTs have been demonstrated (Andronic et al., 1 Apr 2025).
  • CompressedLUT/ReducedLUT: Hierarchical decomposition of LUTs, together with the introduction of “don’t care” entries for input patterns never seen in the training set, allows up to 1.63×1.63\times further reduction in physical LUT count at negligible accuracy loss (≤0.01 percentage point) (Cassidy et al., 2024).
  • Connectivity Optimization (SparseLUT): Instead of random fixed masks, SparseLUT dynamically grows and prunes the fixed-fan-in selection for each neuron, resulting in up to $2.13$ percentage point higher accuracy without increasing hardware cost or latency (Lou et al., 14 Jan 2026).

5. Empirical Results and Comparative Metrics

Quantitative evaluation demonstrates:

  • MNIST, HDR-5L NeuraLUT: 96%96\% accuracy at $54,798$ LUTs, $12$ns latency (Andronic et al., 2024, Guo, 9 Jun 2025).
  • Jet Substructure (JSC-2L): 72%72\% at $4,684$ LUTs, $3$ns (Andronic et al., 2024, Guo, 9 Jun 2025).
  • Area-delay product: 6.58×1056.58\times10^5 for HDR-5L, with NeuraLUT-Assemble reducing this by up to 62×62\times ($5,076$ LUTs, $2.1$ns) for 97.9%97.9\% accuracy (Andronic et al., 1 Apr 2025).
  • LUT-NA (Digital CMOS): 1.23×1.23\times area and 1.8×1.8\times energy reduction vs. Wallace-Tree MAC; up to 50.95×50.95\times area and 6.25×6.25\times energy reduction vs. naïve LUT-based designs at <1%<1\% accuracy loss (VGG, ResNet, GoogleNet) (Sen et al., 2024).

Comprehensive comparisons reveal NeuraLUT’s resource and latency efficiency substantially outperform traditional DSP, fixed-point MAC accelerators, pruned BNNs, and polynomial LUT networks when comparable accuracy is maintained.

6. Scalability, Limitations, and Application Domains

The principal bottleneck in NeuraLUT is LUT size scaling: 2βF2^{\beta F} entries per neuron restricts allowed FF, β\beta. For large-fan-in or high-precision, assembly trees, hierarchical decomposition, or hybrid architectures combining DSP-based and LUT-based blocks can be employed (Andronic et al., 1 Apr 2025, Lou et al., 14 Jan 2026). Dynamic reconfiguration (retraining LUT truth tables on the fly) is not feasible.

NeuraLUT approaches are most effective for:

Notably, NeuraLUT is less suited where runtime weight updates or continuous-adaptation are required due to fixed, precomputed mapping.

7. Outlook and Research Directions

Open directions include:

  • Neural Architecture Search (NAS): Automatic tailoring of (F,β,L,S)(F,\beta,L,S) to optimize resource, latency, accuracy envelopes (Guo, 9 Jun 2025).
  • Advanced Compression: Cross-LUT merging, adaptive quantization, and further exploitation of don’t-care-based redundancy (Cassidy et al., 2024).
  • Hybrid and Hierarchical Designs: Integrating NeuraLUT with other forms (DSP, XNOR-BNN, PolyLUT, KAN) for layer- or subgraph-specific optimization (Andronic et al., 1 Apr 2025, Hoang et al., 14 Dec 2025).
  • Extension to Non-Perceptron Models: Exploring feasibility in convolutional, attention, or graph-based blocks.
  • Scaling and Multi-FPGA/ASIC Distribution: Partitioning NeuraLUT workloads for extremely large models (e.g., LLM sub-blocks) (Guo, 9 Jun 2025).
  • Adaptivity to Data: Marking infrequently used table entries as don’t-cares for further compression, without exceeding a prescribed accuracy loss (Cassidy et al., 2024).

The NeuraLUT paradigm remains a foundation for hardware-software co-design in ultra-low latency, parallelizable neural inference, with current frontiers in table compression, assembly methodologies, and adaptive connectivity optimization (Guo, 9 Jun 2025, Lou et al., 14 Jan 2026, Sen et al., 2024, Andronic et al., 1 Apr 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Network-in-LUT (NeuraLUT).