Sequential MAC-Based Architecture

Updated 20 November 2025

Sequential MAC-Based Architecture is a design approach that pipelines multiply-accumulate operations to enhance numerical accuracy and energy efficiency.
It employs exponent-indexed accumulation and variable data formats to optimize hardware performance across FPGA and ASIC platforms.
The architecture underpins high-throughput neural network accelerators and low-power analog systems by balancing precision, area, and power trade-offs.

Sequential Multiply-Accumulate (MAC)-Based Architecture refers to circuit and system designs in which multiply-accumulate operations are orchestrated in a stepwise or pipelined sequence, as opposed to massively parallel or fully combinational arrangements. These architectures support a range of computational precision and data formats, including floating-point, integer, posit, and logarithmic encodings, with particular attention to hardware efficiency, numerical accuracy, and energy profile. The following sections provide a detailed technical survey of sequential MAC designs, focusing on algorithmic workflows, architectural innovations, hardware mapping, format unification, and empirical performance metrics.

1. Algorithmic Structure and Operational Workflow

Sequential MAC architectures are fundamentally characterized by the disaggregation of two core phases: accumulation of partial products and final reconstruction of the sum. The exponent-indexed accumulator (EIA) workflow exemplifies this:

Inputs are represented as $x_i = m_i \times 2^{e_i}$ , decomposing each number into mantissa $m_i$ and exponent $e_i$ (Liguori, 2024).
Accumulation Phase: For each incoming $x_i$ , its $m_i$ is stored in a signed integer register $A[e_i]$ indexed by $e_i$ . For $n_e$ exponent bits, $2^{n_e}$ bins are maintained, each wide enough to prevent overflow for $N$ -length sums: $w = n_m + 1 + n_v$ , with $n_v = \lceil \log_2 N \rceil$ .
Reconstruction Phase: The final sum,

$S = \sum_{e=0}^{2^{n_e}-1} A[e] 2^e,$

is emitted sequentially, typically using a shift-and-add bit-serial logic. Truncation strategies may be applied for latency-accuracy tradeoff.

Designs such as Single-DSP–Multiple-Multiplication (SDMM) (Kalali et al., 2021) and encoding-based MACs (Liu et al., 2024) further exploit parameter manipulation, gate-level encodings, and parallel LUT-based post-processing to support multiple concurrent MACs per hardware block and to reduce critical datapath length.

2. Architectural Innovations and Data Format Support

Recent advances align sequential MAC units to support diverse data-types:

Exponent Grouping: The EIA design enables exponent bins to be grouped ( $k$ parameter), reducing register count and shifter width. The grouping parameter $0 \leq k \leq n_e$ introduces area and power trade-offs, where intermediate $k$ balances practical concerns (Liguori, 2024).
Precision-Scalable CSMs: The Jack unit generalizes FP, INT, and "microscaling" formats by reconstructing the carry-save multiplier (CSM) to accept variable mantissa widths and data encoding paradigms. In-CSM significand adjustment (using barrel shifting for exponent differences) allows immediate alignment at the multiplication stage, bypassing post-adder alignment (Noh et al., 7 Jul 2025).
Logarithmic and Posit Extensions: Decoding of posits into $(m_i, e_i)$ for EIA and log-number mappings with LUT-based fractional exponent decoders allow near-unified accumulation and summation platforms (Liguori, 2024).

The EncodingNet (Liu et al., 2024) replaces tree multipliers with sets of logic gates and bit-plane popcount accumulators, using weighted accumulation vectors $s_j$ solved jointly from the multiplier's truth table.

3. Hardware Implementation Strategies

Sequential MAC blocks are mapped onto diverse hardware fabrics with topology-specific optimizations:

FPGA Mapping:

EIA uses distributed RAM (FPGA LUTs as small RAMs) for partial-sum storage. For example, a dense bfloat16 MAC core operates in $>$ 700 MHz, realizing 64 MACs per cycle using $\sim$ 6400 LUTs and 64 DSPs per tensor core (Liguori, 2024).
SDMM packs $k=3$ 8-bit multiplies into a single Xilinx DSP48 block; post-processing and partial sum unpacking are handled in parallel LUT logic and BRAM (Kalali et al., 2021).
BRAMAC extends block RAM with compute-in-memory features, using hybrid bit-serial and bit-parallel flows for 2- to 8-bit signed MACs, achieving up to 2.6 $\times$ speedup at tolerable area increase (Chen et al., 2023).

ASIC Design:

The Jack unit's area and dynamic power scale quadratically with mantissa width $m$ : $A_\text{total}(m, L) = L[a_1 m^2 + a_2 m] + a_0$ ; similarly, $P_\text{total}(m, L)$ follows corresponding scaling laws (Noh et al., 7 Jul 2025).
Critical path management is optimized via grouping shared shifters and adder-trees. In EIA structures, NAND gate counts drop from 34k (full bin) to 5k (grouped, $k=3$ ) to 1k (Kulisch) in bfloat16 configurations (Liguori, 2024).
Analog implementations (VCO-based, EKGNet) utilize current-controlled oscillators and subthreshold transistor circuits to replace digital MAC blocks for low-energy, high-accuracy signal processing (Banerjee et al., 2016, Haghi et al., 2023).

4. Integration and System-Level Dataflow

Sequential MAC cores are deeply embedded in accelerator pipelines, supporting systolic arrays, weight-stationary, and tiling-based neural network inference engines:

Systolic Arrays: SDMM and BRAMAC architectures are instantiated in 12×12 PE arrays with broadcasted activations/weights, where pipelined MAC packing multiplies the effective compute throughput (Kalali et al., 2021, Chen et al., 2023).
Interface Standardization: EncodingNet and Jack units expose standard streaming interfaces for activations, weight loading, and partial sum emission, allowing direct RTL-level swaps in modern DNN accelerators without needing changes to buffer logic or dataflow routines (Liu et al., 2024, Noh et al., 7 Jul 2025).
Analog MAC Dataflow: In EKGNet, analog multiply-accumulate blocks are time-multiplexed and coordinated via non-overlapping clocks. Full feature extraction and classification pipelines are realized in analog, with conversion only at the final output stage (Haghi et al., 2023).

5. Performance Metrics and Empirical Results

Comprehensive benchmarks detail throughput, area, energy, and quality trade-offs:

Design	Area Reduction	Power Reduction	Throughput	Accuracy Loss	Energy per MAC
EIA (Liguori, 2024)	up to 80%	up to 70%	1 MAC/cycle (700+ MHz FPGA)	0	Not stated
SDMM (Kalali et al., 2021)	66.6–83.3% DSP	36–64% dyn pwr	up to $k$ ×DSP base	<0.4%	Not stated
BRAMAC (Chen et al., 2023)	6.8%/3.4% area	—	up to 2.6×	None	Not stated
EncodingNet (Liu et al., 2024)	28–80%	10–70%	—	<0.7%	Not stated
Jack unit (Noh et al., 7 Jul 2025)	1.17–2.01×	1.05–1.84×	—	None reported	1.3–5.4× gain
VCO-MAC (Banerjee et al., 2016)	—	400× lower	2 fJ/MAC	<½ LSB	400× digital MAC
EKGNet MAC (Haghi et al., 2023)	—	ultra low (<11 μW)	—	negligible	—

Accuracy Preservation: Quantization and encoding strategies (e.g., weighted popcount accumulations, MW_A grid approximations) allow nearly lossless DNN inference (<0.7% Top-1 deviation) despite radical area/power savings (Kalali et al., 2021, Liu et al., 2024).
Energy-Efficiency: The Jack unit yields up to 5.4× system-level improvement in MAC energy/cycle vs. prior art, due to optimized CSM and in-exponent alignment (Noh et al., 7 Jul 2025).
Analog Significance: Time-domain and subthreshold analog MACs achieve fJ-level energy per MAC while remaining resilient to PVT variations, significantly broadening the applicability to edge and biomedical scenarios (Banerjee et al., 2016, Haghi et al., 2023).

6. Extensions, Limitations, and Applicability

Sequential MAC architectures generalize to a broad range of formats (FP, INT, posit, logarithmic, microscaling), with adaptable pipeline stages and parameterizations:

Extensions: Formats such as posit and logarithmic are supported via appropriate decoding and conversion of fractional exponents or regime bits into $(m,e)$ form, leveraging the same accumulation logic (Liguori, 2024).
Limitations: Sequential architectures may encounter throughput bottlenecks for extremely high-bandwidth applications, though pipelined, multi-lane arrangements mitigate this in both digital and analog arrays. Analog MACs, while ultra-efficient, are limited by process and temperature drift, requiring circuit-level PVT hardening (Haghi et al., 2023).
Application Domains: Sequential MAC units are established in DNN inference accelerators, image and biomedical signal processors, compute-in-memory fabrics, and ultra-low-energy analog front-ends for edge classification and in-sensor computing (Noh et al., 7 Jul 2025, Haghi et al., 2023, Banerjee et al., 2016).

7. Historical and Contemporary Significance

Sequential MAC architecture, particularly embodied in the exponent-indexed accumulator paradigm (Liguori, 2024), has evolved to accommodate precision scalability, diverse numeric formats, and aggressive area/power demands for edge and AI-centric computing. Its formalization and optimization underpin state-of-the-art neural network accelerators, compute-in-memory modules, and analog inference engines, consolidating hardware innovation around flexible, low-power multiply-accumulate semantics while preserving algorithmic fidelity and deployment versatility across heterogeneous platforms.