Papers
Topics
Authors
Recent
Search
2000 character limit reached

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Published 13 Feb 2026 in cs.AR and cs.AI | (2602.12962v1)

Abstract: Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based LLMs have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized hardware for essential nonlinear operations by using fast and accurate LUT, thereby maximizing performance gains and reducing hardware-cost in on-device environments, and finally, by taking practical hardware constraints into account, further employs scheduling techniques to maximize computational utilization even under limited on-chip memory capacity. We evaluate the performance of TriGen on various LLMs and show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.

Summary

  • The paper presents a co-designed NPU architecture using MXINT8 and LUT-based nonlinear acceleration, achieving a 2.73× speedup and a 52% reduction in memory transfers.
  • It employs microscaling datatypes and a resource-aware scheduling strategy that minimizes DRAM traffic and optimizes compute utilization for LLM inference.
  • The study validates that precision-preserving quantization and software-hardware co-design enable effective on-device LLM acceleration without sacrificing language modeling performance.

TriGen: End-to-End LLM Acceleration via Software-Hardware Co-Design for Resource-Constrained NPUs

Introduction

As transformer-based LLMs increasingly dominate NLP workloads, deploying these models in resource-constrained, on-device environments poses significant architectural and algorithmic challenges. Notably, LLMs exhibit exponentially growing model sizes and minimal parameter reuse per inference, resulting in acute pressure on memory bandwidth and compute efficiency. Traditional neural processing unit (NPU) architectures, designed primarily for dense compute throughput, struggle when such dataflow and activation/weight storage requirements far outstrip available on-chip resources.

TriGen introduces a novel NPU architecture fundamentally co-designed across software and hardware, specifically addressing end-to-end LLM acceleration in on-device scenarios. TriGen’s main contributions are the adoption of microscaling (MX) low-precision datatypes with robust hardware support for both activations and weights, LUT-based acceleration of nonlinear operators (with negligible accuracy loss), and resource-aware scheduling/tiling strategies that maximize utilization within memory constraints. Experimental results show an average 2.73× speedup and a reduction of memory transfer by 52% versus baseline NPU approaches, while incurring no meaningful loss in language modeling performance (2602.12962).

Architectural Innovations: TriGen NPU Design

TriGen’s architecture is shaped by three interlocking components: native support for MXINT8 low-precision formats, a LUT-based nonlinear operators pipeline, and a resource-aware, software-driven scheduling system.

The high-level architecture of TriGen comprises a central control processor (CP), multiple DLA (deep learning accelerator) cores, a tensor manipulation unit (TMU), a 1MiB global on-chip buffer, and tight ISA integration. Figure 1

Figure 2: Overview of the complete TriGen architecture, illustrating interplay of CP, DLA cores, TMU, buffer, and primary dataflow.

MXINT8 and FI32: Efficient Intermediate Data Representation

With paramount emphasis on minimizing DRAM transfer and storage, TriGen employs the MXINT8 microscaling integer format for activations and a 4-bit quantized representation for weights. To enable high-accuracy accumulation and intermediate computations, the architecture introduces a FI32 (Floating Integer 32-bit) format, consisting of an 8-bit exponent and a 24-bit integer fraction, which serves as the intermediate data type between compute modules. Figure 3

Figure 4: FI32 data format utilized by TriGen for intermediate accumulation and transfer, balancing integer arithmetic efficiency with dynamic range.

This representation delivers significantly reduced memory footprint (~50%), enabling larger tiles and parameter shards to be cached on limited on-chip buffers, directly improving compute utilization and decreasing data movement bottlenecks.

Latency and Bottleneck Analysis

A major architectural insight is the quadratic growth of nonlinear operator latency (e.g., softmax, SiLU, normalization) as input sequence length increases, unlike linear layers which scale less aggressively. Detailed profiling demonstrates that nonlinear operations can account for up to 20.5% of end-to-end latency at 4k tokens, and their share grows rapidly with larger context windows. Figure 5

Figure 5: End-to-end LLM inference latency breakdown versus sequence length, highlighting nonlinear operation bottlenecks at longer contexts.

This exposes a critical shortfall in conventional NPUs, which often accelerate only linear modules, leaving nonlinear functions to under-resourced special function units (SFUs) or DSPs.

Acceleration of Nonlinear Operations via Hardware LUTs

TriGen’s architecture eliminates the need for dedicated SFU/DSP blocks for nonlinearities. Instead, it embeds a hardware LUT module, carefully co-designed with software to cover the operational domain of all key nonlinearities (reciprocal, inverse sqrt, exponential, SiLU) and supporting linear interpolation as well as error correction via a dual-table scheme. Empirical results confirm sub-0.1% MAPE and negligible MSE for all targeted nonlinear functions, supporting end-to-end accuracy preservation. Figure 6

Figure 1: High-level view of the LUT module responsible for nonlinear operations, incorporating preprocessing, interpolation, and error correction.

Resource-Constrained Dataflow and Scheduling

Memory bandwidth and buffer constraints are dominant in on-device LLM acceleration. TriGen proposes a software-optimized dataflow comprising stationary selection (IN0/IN1) and tiling, foreshadowed by analytical modeling of on-chip/off-chip transfer minimization, arithmetic intensity, and parallel workload distribution. Figure 7

Figure 3: Normalized latency and DRAM traffic comparisons for stationary dataflow selection, demonstrating up to 58% latency and 24% bandwidth benefit with optimal stationary choice for LLM projections.

Figure 8

Figure 9: Normalized latency across tiling configurations, showing 12.5% reduction with optimal software-selected tile size for LLM layers.

A deterministic compile-time optimizer searches the joint space of stationary policies and tile dimensions, integrating hardware-level soft constraints (DLA core count, SRAM size) and model-graph statistics to yield highly efficient and predictable mapping.

Software-Hardware Co-Design Optimizations

TriGen’s software stack encompasses multiple fusion and operator reordering strategies, including:

  • Operator fusion via mathematical equivalence, e.g., eliminating explicit transpose by modifying matmul ordering and rescale logic.
  • QKV projection batching, which groups queries, keys, and values projections to reuse loaded parameters, dramatically reducing redundant DRAM bandwidth.
  • Coalesced masking in attention via fused bias addition rather than elementwise MULs, exploiting hardware LUT support to mask outputs in a single stage.

These techniques, when successively incorporated, cumulatively deliver the observed speedup. Figure 10

Figure 11: Cumulative effect of each optimization strategy on LLM inference latency. MXINT8 delivers 1.65×, with LUT and QKV batching driving further improvement to reach 2.73× speedup.

Precision, Accuracy, and Memory Transfer

Detailed ablation reveals that naive 8-bit quantization of activations induces catastrophic perplexity degradation for most LLMs, even if weights are quantized to 4 bits. Only the adoption of MXINT8 or equivalent multi-exponent schemes, precisely supported in both hardware and software, achieves quantization without meaningful loss of language modeling performance. Figure 12

Figure 6: Effect of MXINT8 adoption on DRAM traffic, resulting in a 50% reduction compared to legacy 16-bit integer activations for LLM workloads.

This validates TriGen’s central design rationale of supporting microscaling datatypes rather than conventional INT8/FP8.

Scalability: Multi-NPU Systems and Variable Sequence Lengths

TriGen’s low-overhead synchronization primitives and systematic workload partitioning enable near-linear scaling up to 4 NPUs, with diminishing returns at higher NPU counts due to DRAM bandwidth limits. Performance also remains robust across variable sequence lengths; as context grows, the advantage of hardware-accelerated LUTs for nonlinearities becomes even more pronounced.

Theoretical and Practical Implications

The TriGen architecture demonstrates that end-to-end LLM acceleration on resource-constrained NPUs is possible only by simultaneously innovating at the software and hardware boundary. The work provides:

  • The first practical MXINT8 NPU implementation for on-device LLM inference.
  • Evidence that hardware LUTs can supplant SFU/DSP pipelines for all key transformer nonlinearities without sacrificing either throughput or precision.
  • A template for automated, resource-aware dataflow mapping via integrated scheduling and tile selection.
  • Empirical affirmation that naive INT8 activation quantization remains unsuitable for accuracy-sensitive LLM deployments without MX-like support.

Future Directions

The successful software-hardware co-design of TriGen opens several further research avenues: dynamic or adaptive precision scaling, extension to multi-model workloads on dielectric NPUs, mixed-datatype scheduling for heterogenous transformer variants, or even further reduction in DRAM requirements via in-place KV cache compression. Adapting the approach to vision-language and multi-modal foundation models is also viable, given their similar bottlenecks in memory bandwidth and nonlinear operator composition.

Conclusion

TriGen sets a new benchmark in end-to-end LLM acceleration for on-device environments, achieving superior latency and memory efficiency through MXINT8 activations, LUT-based nonlinearity support, and resource-optimized scheduling. The platform substantiates the necessity of co-designing both neural network precision strategies and hardware datapaths to unlock high-throughput, memory-efficient inference. These results suggest a path forward for deploying progressively larger models under mobile and edge constraints, with no material loss in model accuracy (2602.12962).

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper introduces TriGen, a new kind of computer chip design (an NPU, or Neural Processing Unit) plus matching software that works together to run LLMs fast on small devices like phones and embedded systems. These devices have limited memory and power, which makes today’s big transformer models hard to run end-to-end. TriGen’s idea is to carefully co-design the software and the hardware so the model runs fully on-device, with less memory movement, lower precision numbers that still keep accuracy, and faster handling of tricky “nonlinear” math steps.

What questions are the authors trying to answer?

In simple terms, the paper tackles three big questions:

  • How can we use smaller numbers (like 8-bit instead of 16-bit) for both model weights and activations without breaking accuracy?
  • How can we speed up nonlinear functions (like softmax, normalization, and SiLU) that often become bottlenecks as sequence length grows?
  • How can we smartly move and reuse data so the chip does more useful work and wastes less time waiting on memory?

How did they do it?

The approach combines three main ideas, each with an everyday analogy to make it clearer.

  1. Smaller numbers that still act “big enough” (Microscaling, MX)
  • Analogy: Imagine you and your friends share one magnifying glass to read tiny text. Each person holds a small card (an 8-bit integer), and the group shares one magnifier (a scale) to see the real value clearly. This lets you store and move smaller cards but still read accurately when needed.
  • What it is: TriGen uses MXINT8, an 8-bit format with a shared scale (exponent) per group. It also adds a simple, precise internal format called FI32 to keep intermediate results accurate while staying efficient.
  • Why it helps: 8-bit data halves memory and compute cost compared to 16-bit, and it shrinks the KV cache too. That’s huge on devices with tiny on-chip memory.
  1. Fast nonlinear math using lookup tables (LUTs)
  • Analogy: Instead of recalculating a hard math problem every time, you look it up in a smart, small “cheat sheet” and, if needed, do a tiny correction.
  • What it is: TriGen builds a fast, accurate LUT unit co-designed with the software. It approximates functions like exponential, inverse square root, softmax parts, and SiLU by:
    • Preprocessing the number (split into an exponent and a fraction),
    • Looking up values in compact tables,
    • Using simple interpolation and a tiny “error fix” table to get very close to the true value.
  • Why it helps: It removes the need for special heavy math hardware and avoids converting data to bigger formats. That cuts latency and saves power.
  1. Smart scheduling and tiling of dataflow
  • Analogy: If your suitcase is small, you pack clothes in the right order, keep the items you’ll reuse on top, and avoid repacking. You also coordinate with travel buddies so no one blocks the aisle.
  • What it is: TriGen’s compiler chooses:
    • Which part of the math “stays put” on chip (stationary),
    • How to cut big matrices into tiles that fit on limited SRAM,
    • How to balance work across multiple compute cores or multiple NPUs,
    • How to minimize trips to DRAM (off-chip memory), which are slow and power-hungry.
  • Why it helps: Even a very fast chip can be slowed down by memory. The scheduler co-optimizes compute and data movement so the chip stays busy and efficient.

Extra software tricks that matter:

  • Fuse away costly transposes: By rearranging the math for attention (especially the V projection), TriGen avoids explicitly transposing MX-format tensors, which would be expensive and inaccurate.
  • Masking without extra work: During attention, TriGen uses the LUT pipeline and partial sums to “zero out” masked positions without a separate multiply step.
  • Batch Q/K/V projections: It loads just the needed weights for heads that fit in on-chip memory and processes them efficiently to minimize repeated memory reads.

Hardware at a glance:

  • A controller (RISC-V), multiple deep-learning cores, a tensor unit, and shared on-chip memory (about 1 MiB).
  • A 32×32 multiply-accumulate array with mixed-precision support and a post-processing pipeline that handles bias, rescaling, nonlinear LUTs, and output formatting.
  • Support for multi-NPU synchronization so several NPUs can cooperate without stalling.

What did they find, and why is it important?

Main results:

  • Speed: On several LLMs, TriGen is on average 2.73× faster than a baseline NPU.
  • Memory movement: It reduces memory transfers by 52%.
  • Accuracy: It keeps accuracy nearly the same as higher-precision baselines, despite using 8-bit activations and low-bit weights.
  • Nonlinear ops: As sequences grow (e.g., 2K–4K tokens), nonlinear functions take a bigger slice of time; TriGen’s LUT approach keeps those from becoming a bottleneck.

Why that matters:

  • On-device systems have very limited on-chip memory and constrained bandwidth to DRAM. If you don’t reduce precision for activations too—and if you don’t schedule data well—the chip spends more time moving data than computing.
  • Prior work often used 16-bit activations to keep accuracy, which increases latency and power. TriGen shows a practical path to 8-bit activations with minimal accuracy loss.
  • Eliminating specialized nonlinear hardware and still being accurate simplifies design and saves area and power.

What’s the impact?

In plain terms, this research helps bring truly capable LLMs to small devices:

  • Faster responses and longer battery life on phones, wearables, and cars because the chip moves less data and does fewer heavy calculations.
  • Better privacy and lower cost, since more can be done on-device without sending data to the cloud.
  • Support for longer inputs (like longer documents) without nonlinear functions becoming a time sink.
  • A blueprint for future edge-AI chips: co-design the model, number formats, hardware, and compiler together instead of treating them separately.

Big picture: TriGen shows that careful software-hardware co-design—using smart number formats, LUT-based nonlinear math, and resource-aware scheduling—can make end-to-end LLM inference practical on memory-limited devices, with strong speedups and almost no loss in accuracy.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, formulated to be actionable for future research.

  • End-to-end accuracy with MXINT8: Provide comprehensive accuracy results for TriGen using MXINT8 activations and low-bit weights across multiple LLMs, datasets, and tasks (beyond perplexity), including long-sequence (≥4k, ≥8k) and downstream task performance (e.g., QA, summarization, instruction-following).
  • MX configuration specification: Define the exact microscaling scheme (group size, sharing axis, per-tensor vs. per-channel vs. per-group scales, update policy) and quantify its impact on accuracy, memory, and latency; include calibration/training procedures and runtime scale update rules.
  • FI32 numerical semantics: Clarify FI32 signedness, overflow/underflow behavior, rounding modes, normalization, and conversion costs; provide formal error bounds for ACC exponent alignment and FRAC aggregation, and demonstrate stability across transformer blocks.
  • Softmax implementation details: Specify the full softmax pipeline under low precision (exp, sum, division), whether log-sum-exp is used, and quantify numerical stability/accuracy for high-variance attention scores and long contexts in MXINT8/FI32.
  • RoPE support: Describe how rotary position embeddings are implemented (e.g., LUT-based sin/cos), their precision and memory footprint, and measure end-to-end accuracy/latency impacts under MXINT8.
  • LUT coverage and impact: Extend LUT design beyond reciprocal/ISQR/exp/SiLU to GELU, tanh, sigmoid, and other LLM-relevant functions; provide end-to-end accuracy impact, latency/area of LUT access and interpolation, and formal error bounds across operational ranges.
  • LUT error propagation: Quantify how LUT approximation errors affect attention distributions, normalization, and activation outputs (e.g., bias in softmax weights), with per-layer sensitivity and tight error budgets.
  • Elimination of SFU vs. LUT trade-offs: Provide synthesized area, frequency, power, and latency comparisons of LUT-based nonlinear processing versus conventional SFUs/VPUs on a specific process node; include TOPS/W and TOPS/mm².
  • Baseline clarity and reproducibility: Fully specify the baseline NPU (datatypes, MAC array size, SRAM/DRAM bandwidth, frequency, compiler/tile policies) used for the 2.73× speedup and 52% DRAM reduction to enable fair, replicable comparisons.
  • Energy and thermal evaluation: Report energy per token, power draw, and thermal behavior under realistic on-device conditions; compare to FP16 activation baselines and analyze DVFS effects.
  • Memory system characterization: Detail on-chip buffer sizes (IBUF/WBUF/OBUF), banking, interconnect, DMA behavior, off-chip bandwidth assumptions (bus width, frequency), bank conflicts, and arbitration; provide sensitivity analyses.
  • KV cache management: Describe how K/V tensors are stored in MXINT8 (or other formats), compression policies, precision impacts over long contexts, sliding-window attention, and memory mapping strategies to minimize DRAM traffic.
  • Dataflow optimizer formalism: Present the optimizer’s cost model (DRAM traffic, compute overlap, SRAM constraints), search complexity, guarantees (e.g., near-optimality bounds), and generalization beyond matmul (e.g., elementwise, reductions, transposes).
  • Shape mismatch and underutilization: Analyze MAC array underutilization for non-multiple-of-32 dimensions, padding strategies, and their effects on throughput and energy; propose adaptive tiling to mitigate utilization loss.
  • Multi-NPU scaling: Specify partitioning strategies (tensor parallel, pipeline parallel), interconnect requirements, synchronization overhead, load balancing, and evaluate scalability and energy/performance trade-offs across 2–N NPUs.
  • Compiler/toolchain availability: Document the ISA, code generation, operator fusion passes, quantization pipelines, calibration tooling, and integration with common ML frameworks; provide artifacts for reproducibility.
  • Wider mixed-precision exploration: Evaluate INT3/INT2 weights, FP8 variants, or hybrid formats with MXINT8 activations; provide accuracy–performance trade-off curves per layer/operator and model scale.
  • Outlier handling under MX: Propose and evaluate algorithms to detect/mitigate activation outliers within MX (without falling back to 16-bit), including dynamic clipping/scaling strategies and their accuracy impacts.
  • Quantization scope: Assess quantization of embeddings, normalization/gating parameters, and logits; specify rounding modes and zero-point choices, and measure their contribution to end-to-end accuracy and latency.
  • Mask fusion correctness: Provide formal correctness guarantees and heuristics for “skip masked sub-matrix” detection; quantify worst-case overhead, accuracy impact under diverse masks (causal, padding, attention bias), and generalization to bidirectional attention.
  • Long-context robustness: Evaluate numerical stability and accuracy for very long contexts (≥8k, ≥32k) and streaming/continuous batching modes, including accumulated rounding errors across autoregressive timesteps.
  • Reliability and fault tolerance: Study resilience to bit-flips/ECC events in SRAM/LUT tables, error detection/recovery mechanisms, and their impact on accuracy/performance in on-device conditions.
  • PVT and DVFS sensitivity: Characterize performance/accuracy across process, voltage, and temperature corners, and under DVFS; provide guidance for robust operating points on mobile SoCs.
  • Coverage of diverse LLM architectures: Test and adapt TriGen for models with different nonlinearities and structures (e.g., GELU, layer norm variants, MoE, sparse attention), and detail any required hardware/software changes.
  • FI32 storage overhead: Quantify the SRAM footprint and bandwidth cost of FI32 intermediates (PSUM, CWQ, bias) in end-to-end schedules, and evaluate trade-offs versus pure integer accumulation strategies.

Practical Applications

Immediate Applications

Below are actionable, real-world use cases that can be deployed now or piloted with modest integration, linked to sectors and practical workflows/tools, with assumptions and dependencies noted.

  • On-device LLM assistants for smartphones and tablets (consumer electronics, software)
    • Use cases: offline summarization, note-taking, translation, email drafting, voice command understanding, and privacy-preserving chat.
    • Tools/products/workflows: MXINT8 + INT4 quantization pipeline for Llama/OPT-family models; TriGen-style compiler passes (operator fusion, resource-aware tiling); runtime using TMATMUL and LUT backends; KV-cache in 8-bit.
    • Assumptions/dependencies: Hardware must support integer MAC arrays and a LUT-backed nonlinear pipeline or equivalent; models calibrated for MXINT8 activations; 1–4 MiB SRAM available; negligible accuracy loss holds for target tasks.
  • Automotive infotainment voice assistants (automotive, software, energy)
    • Use cases: offline natural language control for navigation, media, and settings; low-latency responses under tight thermal/energy budgets.
    • Tools/products/workflows: TriGen-like NPU IP blocks in SoCs; LUT-based softmax/SiLU; PSUM masking fusion to avoid costly element-wise ops; compiler scheduling tuned for in-vehicle memory bandwidth.
    • Assumptions/dependencies: SoCs expose sufficient on-chip SRAM and deterministic scheduling; qualification for automotive-grade reliability.
  • Wearables and edge devices (consumer electronics, energy)
    • Use cases: smartwatch dictation, quick replies, on-device intent detection with multi-NPU concurrency (e.g., LLM + audio model).
    • Tools/products/workflows: Multi-NPU Sync-ID firmware; batched QKV projections to keep parameters resident in SRAM; resource-aware tiling for sub-megabyte memories.
    • Assumptions/dependencies: Presence of multiple small NPUs or heterogeneous accelerators; firmware access to synchronization registers.
  • Enterprise privacy-preserving workflows (finance, healthcare, policy)
    • Use cases: local summarization of emails/meetings, document redaction suggestions, basic triage assistants running fully on-device to avoid data egress.
    • Tools/products/workflows: MXINT8 activation quantization to fit KV cache; FI32 PSUM accumulation for accuracy; compiler-level fusion of transpose and masking to minimize DRAM traffic.
    • Assumptions/dependencies: Acceptable task accuracy with low-precision; device management policies allow local AI; compliance review.
  • Reduced energy consumption in existing NPUs via software-side scheduling (energy, software, academia)
    • Use cases: lower DRAM traffic (up to ~52% reduction reported) and improved utilization on memory-bound LLM inference by adopting TriGen’s dataflow mapping ideas.
    • Tools/products/workflows: Integrate resource-aware stationary/tile selection as a pass in TVM/Triton/XLA; expose per-operator tile knobs; deterministic runtime scheduling.
    • Assumptions/dependencies: Access to compiler stack and low-level ISA controls; hardware that supports transposed matmul or equivalent; benefits depend on SRAM capacity and bandwidth.
  • LUT-backed nonlinear operations on existing accelerators (software, hardware)
    • Use cases: faster softmax, SiLU, normalization in long-context attention where nonlinears can account for ~20% latency at 4k sequence length.
    • Tools/products/workflows: Compact+residual LUTs with interpolation; firmware/DSP microcode replacing SFU calls; precomputation of error tables; pipelined with PSUM.
    • Assumptions/dependencies: Hardware path to invoke LUTs with exponent-aware scaling; accuracy validated for operational ranges; memory footprint for tables (~kB scale).
  • Academic benchmarking and curriculum (academia)
    • Use cases: course labs and research projects on SW-HW co-design; reproducing latency breakdowns; studying microscaling and LUT approximations.
    • Tools/products/workflows: Open-source benchmarks for attention latency vs. sequence length; quantization calibration scripts; evaluation harness for LUT MAPE/MSE.
    • Assumptions/dependencies: Availability of model weights and calibration data; access to a programmable accelerator or simulator.
  • Policy guidance toward on-device AI (policy)
    • Use cases: RFPs and procurement criteria that favor on-device AI for privacy, latency, and energy; guidance for limiting cloud inference for sensitive contexts.
    • Tools/products/workflows: Technical briefs showing 2.73× speedup and memory savings; energy-per-inference metrics; privacy impact assessments.
    • Assumptions/dependencies: Stakeholder buy-in; standardized reporting of energy and accuracy; device capability disclosure.
  • Developer workflows for LLM deployment on constrained hardware (software)
    • Use cases: repeatable pipeline from FP16 to MXINT8/INT4; operator fusion (transpose elimination via TMATMUL equivalence, coalesced masking); head-splitting + batched QKV to fit SRAM.
    • Tools/products/workflows: Quantization toolkits (PyTorch/TensorRT plugins) that emit MXINT8; compiler passes for TriGen-style ISA; LUT generator with accuracy reports.
    • Assumptions/dependencies: Model families compatible with microscaling (limited outliers); calibration datasets; ability to adjust stationary policy per operator.

Long-Term Applications

These opportunities require further research, scaling, standardization, or hardware productization before broad deployment.

  • Mass-market TriGen-class NPU IP in mobile/edge SoCs (consumer electronics, automotive, robotics)
    • Use cases: ubiquitous low-power LLM capabilities across phones, AR glasses, appliances, and in-cabin systems.
    • Tools/products/workflows: MPA with exponent-aware mixed-precision; PPA with LUT engine; FI32 accumulators; multi-NPU synchronization fabric; ISA and compiler ecosystem.
    • Assumptions/dependencies: Silicon design and verification cycles; EDA tooling; vendor ecosystem and SDK support; sustained model accuracy under MXINT8.
  • Standardization of MXINT8 and FI32 across ML stacks (software, academia, industry consortia)
    • Use cases: cross-vendor portability of models and kernels; consistent calibration pipelines and ONNX/PyTorch dtype support.
    • Tools/products/workflows: Quantization specs; ONNX opset extensions; training-time aware quantization and post-training calibration for microscaling.
    • Assumptions/dependencies: Community agreement on formats; compiler/runtime support; robust handling of activation outliers.
  • Long-context on-device LLMs (software, consumer electronics, enterprise)
    • Use cases: 8k–32k token context for meeting assistants, e-readers, code assistants, and document summarizers without cloud dependency.
    • Tools/products/workflows: KV cache compression with MXINT8; SRAM-aware tiling; sequence-aware scheduling; LUT-accelerated nonlinears to cap quadratic latency.
    • Assumptions/dependencies: Increased on-chip memory or hierarchical caching; careful accuracy validation at long contexts; thermal design margins.
  • Multimodal on-device AI (healthcare, education, robotics)
    • Use cases: vision-language assistants (e.g., scene description, tutoring) and speech-grounded LLMs in robots and assistive devices.
    • Tools/products/workflows: Extend LUT library to GELU/other nonlinears; integrate vision transformer backbones; co-designed dataflow across modalities.
    • Assumptions/dependencies: Hardware support for mixed workloads (vision + language); memory partitioning strategies; model retraining for microscaling.
  • Edge AI orchestration with multi-accelerator QoS (software, systems)
    • Use cases: concurrent execution of multiple models (ASR, LLM, personalization) with deadlock-free synchronization and latency guarantees.
    • Tools/products/workflows: Sync-ID instruction semantics extended to cross-accelerator fabrics; runtime schedulers with deterministic dataflow; thermal-aware task allocation.
    • Assumptions/dependencies: Heterogeneous accelerator presence; OS/runtime integration; priority and preemption policies.
  • Sector-specific, validated on-device assistants (healthcare, finance, public sector)
    • Use cases: clinical note summarization on secure tablets; compliance Q&A; offline form-filling and triage.
    • Tools/products/workflows: Domain fine-tuning with microscaling-aware quantization; rigorous validation pipelines; audit trails; model cards documenting accuracy trade-offs.
    • Assumptions/dependencies: Regulatory approvals; risk management for quantization-induced errors; secure hardware enclaves.
  • Energy policy and labeling for AI features (policy, energy)
    • Use cases: standardized energy-per-inference reporting; incentives for local inference; eco-labels for AI features on devices.
    • Tools/products/workflows: Benchmarks capturing DRAM traffic reductions; lifecycle energy models; certification programs.
    • Assumptions/dependencies: Industry collaboration; test protocols; consumer communication standards.
  • Developer ecosystem and toolchain maturation (software, academia)
    • Use cases: robust compiler backends targeting TriGen ISA; autotuners for tile/stationary policy; model zoo of MXINT8-ready LLMs.
    • Tools/products/workflows: TVM/Triton/XLA extensions; autotuning datasets; LUT accuracy validation suites; reference implementations of operator fusions.
    • Assumptions/dependencies: Funding and community participation; documentation and training; interoperability with existing ML stacks.
  • Robotics and embedded control with local language understanding (robotics)
    • Use cases: voice-to-action pipelines on mobile robots/drones; task planning with limited connectivity.
    • Tools/products/workflows: Tight integration of LLM with control loops; deterministic scheduling under power constraints; safety validation.
    • Assumptions/dependencies: Real-time guarantees; robust fallbacks for misinterpretations; hardware safety margins.
  • Secure personal data processing on consumer devices (daily life, privacy)
    • Use cases: local summarization of private chats, journals, photos (text annotations) without cloud upload.
    • Tools/products/workflows: MXINT8 pipelines that fit memory budgets; extended context for personal histories; opt-in privacy settings.
    • Assumptions/dependencies: Usability and trust; device storage and compute budgets; clear privacy messaging.

Each long-term application benefits directly from TriGen’s SW-HW co-design principles—microscaling for activations, LUT-accelerated nonlinears, and resource-aware scheduling—but will require broader ecosystem support, hardware productization, and rigorous accuracy/robustness validation to be realized at scale.

Glossary

  • Accumulator (ACC): A register or unit that accumulates partial sums from MAC operations during matrix computations. "The output result of MPA is accumulated in ACC."
  • Arithmetic intensity: The ratio of computation to data movement; higher values generally mean better performance under memory constraints. "to maximize arithmetic intensity and minimize data movement during on-device execution."
  • Channel-wise quantization (CWQ): Quantization scheme applying distinct scales per output channel to preserve accuracy. "such as bias (BIAS), partial sum (PSUM), and channelwise rescale (CWQ), where maintaining higher precision is essential."
  • Control processor (CP): A lightweight controller (here RISC-V) that orchestrates the NPU’s functional units and instruction flow. "CP is a lightweight RISC-V processor that controls and manages other components."
  • Deep learning accelerator (DLA): A specialized compute core optimized for neural network workloads. "An NPU contains CP, four DLA cores, TMU, and 1MiB global buffer (on-chip memory, SRAM)."
  • Dequantization: Converting low-precision integers back to higher-precision values for computation. "Frequent dequantization of weight into higher data precision (e.g., FP4 to FP16) incurs severe latency overhead."
  • Digital signal processor (DSP): A programmable vector processor often used for signal and numeric operations; here avoided by on-accelerator LUTs. "seamless acceleration of nonlinear function without SFU or companion vector processor such as DSP"
  • Feed-forward network (FFN): The MLP sub-block in transformers that applies learned projections and nonlinearities. "A decoder layer consists of attention block and FFN."
  • FI32 (Floating Integer 32-bit): An intermediate number format with integer fraction and exponent fields, enabling high-precision accumulation without full floating point. "Furthermore, TriGen incorporates FI32 as an intermediate data type within the architecture's data flow."
  • Fully-connected (FC) layer: A dense linear layer performing learned projections. "output projection layer, which performs projection using FC layers."
  • General matrix multiplication (GEMM): The standard batched matrix-matrix multiply kernel central to DNNs. "Therefore, total cost (i.e., TOPS/mm2, or TOPS/W) of GEMM operation is significantly reduced."
  • Head-splitting: Partitioning attention into multiple heads before projection to enable per-head computation and memory locality. "The head-splitting is performed before projection, enabling tailored attention computation with only the necessary data."
  • Instruction Set Architecture (ISA): The set of operations the accelerator supports and their encodings. "TriGen ISA includes the operations below to accelerate whole process of RMSNorm."
  • KV cache: The stored keys and values from past tokens used to speed up autoregressive attention. "and it offers the advantage of reducing the total amount of data used in the KV cache."
  • LLM: A transformer-based model with billions of parameters used for generative language tasks. "In recent years, numerous research efforts have been conducted to develop accelerators for LLMs."
  • Lookup table (LUT): Precomputed table enabling fast approximations of nonlinear functions (e.g., exp, inverse sqrt, SiLU). "By leveraging a novel, fast, and accurate LUT co-designed with software and hardware, TriGen enable the efficient processing of complex nonlinear operations such as softmax, SiLU, and normalization"
  • MAC (multiply and accumulate): The fundamental operation in linear algebra kernels, multiplying pairs and accumulating their sum. "Each array performs MAC operations between two input vectors with length of 32 and therefore one MPA performs 32×32 MAC operations at once."
  • MAPE (Mean Absolute Percentage Error): An error metric reporting average absolute percentage deviation between approximations and true values. "with a MAPE consistently below 0.1% and a MSE remaining under"
  • matmul (matrix multiplication): Core linear operation multiplying two matrices; the backbone of transformer layers. "linear operations such as matmul account for the majority of computations"
  • MPA (MAC processing array): A systolic-like array of MAC units used for high-throughput dot products and matrix multiplications. "The MPA leverages a (32 × 32) MAC array to perform single-cycle dot product calculations"
  • MSE (Mean Squared Error): An error metric reporting the mean of squared differences between approximations and true values. "with a MAPE consistently below 0.1% and a MSE remaining under"
  • MX (microscaling): A number system using shared exponents for groups of integers to retain dynamic range at low precision. "TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy"
  • MXINT8: An 8-bit microscaled integer data type enabling low-precision compute with shared exponents. "low-precision LLM inference with native MXINT8 format support"
  • Neural processing unit (NPU): A hardware accelerator specialized for neural network inference. "We propose NPU architecture named TriGen that enables: i) low-precision LLM inference with native MXINT8 format support"
  • Outlier (in quantization): Activation values with unusually large magnitude that impair uniform quantization accuracy. "the existence of outliers whose values are extremely large compared to the others hinders the activation quantization."
  • Partial summation (PSUM): Intermediate accumulation value within a fused operator pipeline. "PSUM serves to aggregate values into TMATMUL's output, enabling direct manipulation of fused LUT inputs."
  • Perplexity (PPL): A standard language modeling metric measuring predictive uncertainty (lower is better). "as measured by PPLs."
  • Post-processing array (PPA): Hardware pipeline for biasing, de/quantization, LUT-based nonlinearities, and output formatting. "PPA processes activation function such as ReLU, SiLU by referring to LUT."
  • Processing element (PE): The basic compute unit inside accelerator arrays used for parallel arithmetic. "maximize PE utilization while reducing DRAM access with software-defined dataflow and tiling strategy."
  • RISC-V: An open instruction set architecture used for the NPU’s control processor. "CP is a lightweight RISC-V processor that controls and manages other components."
  • RMSNorm (root mean square normalization): Normalization dividing activations by their RMS, commonly used in modern LLMs. "Llama model can be split into two major parts; attention module and FFN and, at the front of them, RMSNorm layers exist."
  • Scaled Dot-Product Attention (SDPA): The core attention mechanism computing softmax of scaled query-key dot products applied to values. "In the SDPA process of a softmax layer requiring exponentiation, TriGen strategically manipulates PSUM"
  • SiLU (sigmoid linear unit): An activation function defined as x·sigmoid(x), used in transformer MLPs. "nonlinear operations such as softmax, SiLU, and normalization"
  • Special function unit (SFU): Dedicated hardware for complex math functions (e.g., exp, sqrt); TriGen avoids needing it via LUTs. "seamless acceleration of nonlinear function without SFU or companion vector processor such as DSP"
  • Stationary policy: A dataflow choice that keeps one operand resident on-chip to increase reuse and reduce bandwidth. "such as sub-optimal tile sizes or mis-chosen stationary policies"
  • Tensor manipulation unit (TMU): Hardware block for reshaping, transpose, split, and concatenate of tensors. "An NPU contains CP, four DLA cores, TMU, and 1MiB global buffer (on-chip memory, SRAM)."
  • Tera operations per second (TOPS): A throughput metric denoting trillions of operations per second. "While existing NPUs often pursue peak TOPS, their effective performance on memory-limited devices is largely constrained"
  • TMATMUL: A TriGen instruction performing transposed matrix multiplication with on-array tiling and accumulation. "The TMATMUL, which is a representative operation conducted in MPA, has two input matrices IN0 and IN1"
  • Vector processing unit (VPU): A companion vector engine; TriGen’s LUT pipeline removes the need for it. "without additional hardware such as vector processing unit and without any loss of accuracy."
  • Weight-only quantization: Quantizing weights to very low precision while keeping activations at higher precision. "Most of those research focused on weight-only quantization, keeping the precision of activation unchanged and preserving accuracy"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 134 likes about this paper.