Hybrid Precision-Scalable MAC Design

Updated 12 February 2026

The paper presents innovative designs that integrate reconfigurable multiplier arrays with multi-stage reduction trees to deliver high throughput and energy efficiency.
Hybrid precision-scalable MAC designs adapt to a wide range of data formats by combining early integer-style accumulation with floating-point normalization for robust neural processing.
They employ dynamic partitioning, block exponent sharing, and controlled accuracy relaxation techniques to optimize area, power, and performance in state-of-the-art NPUs.

Hybrid precision-scalable reduction tree multiply-accumulate (MAC) designs are integral to modern neural processing unit (NPU) architectures, enabling flexible and efficient computation across a wide dynamic range of data formats, from narrow integer or floating-point precisions for inference to wide dynamic range floating-point for training. These architectures leverage reconfigurable datapaths, hybrid arithmetic pipelines, and optimized adder tree topologies to minimize area, power, and quantization loss while maintaining high throughput and adaptability to emerging mixed-precision standards such as Microscaling (MX) and block-based floating point.

1. Architectural Principles of Precision-Scalable Reduction Tree MACs

Hybrid precision-scalable reduction tree MACs share several core architectural features to enable runtime flexibility in supported precisions and efficient arithmetic across data representations. The fundamental blocks are: a bank of small, bit-slice multipliers (typically 2×2 or 4×4), grouped and orchestrated to support formats ranging from INT8 to FP8/6/4 (for example, E5M2/E4M3, E3M2/E2M3, E2M1). These multipliers feed partial products into a multi-stage reduction tree, which may consist of multiple levels (often L1/L2/accumulate) performing pairwise or groupwise addition, exponent alignment (for floating point), and eventual merging with a high-precision (usually FP32) accumulator.

Block-level exponent sharing and dynamic partitioning are employed, especially in MX and similar block-based formats, to maximize spatial and temporal hardware reuse. Multiplexed datapaths and control logic are designed to permit seamless switching between supported MX types without significant hardware duplication or overhead (Cuyckens et al., 9 Nov 2025, Noh et al., 7 Jul 2025, Ibrahim et al., 2021).

2. Hybrid Reduction-Tree Schematic and Arithmetic

The canonical reduction tree in these MACs is hybrid in that it combines integer-style early accumulation (direct integer addition of aligned significands) with eventual floating-point (FP32) normalization and storage, thus reducing the bitwidth and complexity of the adder while mitigating losses from quantization or overflows. For example, after partial product generation, the tree performs the following sequence:

L1: Pairwise addition of four 10-bit products, producing intermediate 11/12-bit sums.
L2: Aligns the significands to the maximal local exponent, yielding a 28-bit "product sum" (with guard bits).
Accumulate: Merges the 28-bit sum with a 24-bit FP32 mantissa, extends these to 53 bits for addition, then normalizes the result back to the FP32 format.

Mathematically, this flow can be described as: $e_{\max} = \max_i(e_i), \quad pp_i = \mathrm{mant}_i \ll (e_{\max} - e_i), \quad S = \sum_{i=1}^4 pp_i,$ where $S$ is combined with the partial sum and normalized: $A = S \mathbin{\|}_{24} M_{FP32}, \quad R = \mathsf{Norm}(A)$ This early-integer-style accumulation dramatically reduces the register and adder widths, facilitating high-speed, area- and energy-efficient implementations (Cuyckens et al., 9 Nov 2025).

3. Hybrid Accumulation and Controlled Accuracy Relaxation

A defining characteristic of these architectures is the two-level accumulation: first, a local integer-like accumulation of aligned products; second, merging with a reduced-precision FP32 accumulator. To ensure efficiency without excessive over-provisioning, the mantissa bits stored for partial results are truncated beyond a threshold established by analytical error analysis (e.g., down to 16 bits for MXFP8 E4M3), such that the addition error remains below the quantization error inherent to the MX format: $\epsilon_{\mathrm{add}}(M) \leq \epsilon_{\mathrm{quant}}$ where $\epsilon_{\mathrm{add}}(M)$ is the error for $M$ mantissa bits, and $\epsilon_{\mathrm{quant}}$ is the inherent quantization error. This permits a judicious trade-off between power/area and numerical fidelity—a process referred to as controlled accuracy relaxation (Cuyckens et al., 9 Nov 2025).

4. Integration into NPUs and System-Level Optimizations

Incorporation of these MACs into NPUs is typically via tiled arrays (e.g., an 8×8 "MX Tensor Core"), each tile orchestrated by a lightweight FSM for mode selection, accumulation depth, and tile size. The array aggregates partial results to a downstream SIMD quantizer, performing block exponent calculation/casting. Data movement and control is managed through RISC-V CSRs and custom data streaming units supporting dynamic channel gating; only the requisite number of memory ports are active per precision mode, minimizing both DRAM contention and energy (Cuyckens et al., 9 Nov 2025).

Within the Jack unit implementation, 2D sub-word parallelism is exploited by configuring a cluster of precision-scalable carry-save multipliers (CSMs). This enables parallel computation of several narrow-precision products per-cycle, with power gating and resource reallocation governed by a control FSM for maximal efficiency across modes (INT, FP, MX) (Noh et al., 7 Jul 2025).

5. Mathematical Models, Benchmarking, and Design Trade-offs

Design-space analysis and benchmarking are enabled via parameterized templates (e.g., PSMA), allowing empirical exploration of area, energy, and latency. The analytical expressions for MAC latency, energy, and area, with variables $b$ (bitwidth), $G$ (bit-group size), and $d$ (tree depth), provide designers with predictive control:

Latency: $L(b,d) = T_{\mathrm{mult}}(b) + d \cdot T_{\mathrm{add}}$
Energy: $E(b,d) = c_1 b^2 + d c_2 b$
Area: $A(b,d) = a_1 b^2 + d(a_2 b + a_3 f)$ , $f$ =fan-in

Optimal fan-in ( $f=4$ ), bit-grouping (typically $G=2$ ), and pipeline register insertion are shown to be crucial for balancing throughput, energy, and area. Empirically, energy per MAC as low as 0.35–0.42 fJ/MAC at 200MHz and <1.2 fJ/MAC at 1GHz have been demonstrated, with array areas under 0.9 mm² in 28nm technology (Ibrahim et al., 2021).

6. Performance, Efficiency, and Comparative Metrics

These hybrid MAC architectures achieve the following system-level results in leading implementations:

Throughput and energy efficiency (e.g., SNAX NPU, GF22FDX, 0.5GHz):

Mode	Throughput (GOPS)	Energy Efficiency (GOPS/W)
MXINT8	64	657
MXFP8/6	256	1438–1675
MXFP4	512	4065

Performance comparisons demonstrate up to 3.2× higher GOPS/W in FP8/6, 1.6× in INT8, and 1.13× in FP4 versus FP32-add-tree baselines. Adder area and power are reduced by up to 12% and 8%, respectively, compared to long-integer adder designs at 1GHz, while supporting all six MX-relevant formats. MAC array utilization remains above 94% across both inference and training workloads (e.g., ResNet-18/ViT, batch 32), indicating the avoidance of system-level bottlenecks (Cuyckens et al., 9 Nov 2025).

Array-level comparisons with alternative hybrid architectures demonstrate area reductions of 1.60×, compute density improvements of 1.80×, and energy efficiency gains ranging from 1.32× to 7.13×, depending on the data format and benchmark (Noh et al., 7 Jul 2025).

7. Taxonomy, Generalization, and Design Guidelines

The taxonomy presented in (Ibrahim et al., 2021) unifies the design space of hybrid PSMA arrays, providing guidelines such as:

L2 output-sharing is mandatory for optimal energy and area.
Fan-in of 4 and bit-granularity of 2 yield balanced trade-offs between scalability and logic overhead.
BG unrolling at the appropriate level (L3 for low-frequency, bit-serial for high-frequency) optimizes pipeline cost and throughput.
Fixed vs. variable I/O selections depend on target workload precision profiles.

These principles, when applied to hybrid reduction tree MACs, enable robust, scalable, and generalizable hardware supporting the evolving needs of diverse AI workloads.

In summary, hybrid precision-scalable reduction tree MAC designs achieve area- and energy-efficient operation across a wide precision spectrum, seamlessly supporting both low-precision inference and mixed/high-precision training. Through architectural innovations in multiplier organization, reduction tree topology, hybrid accumulation, and system-level integration, they establish high-density, high-efficiency compute substrates foundational to state-of-the-art NPU platforms (Cuyckens et al., 9 Nov 2025, Noh et al., 7 Jul 2025, Ibrahim et al., 2021).