Reconfigurable Parallel Multiplier Array

Updated 30 January 2026

Reconfigurable parallel multiplier arrays are hardware structures that dynamically adjust precision, operand sizes, and dataflow patterns to meet specific performance, power, and area requirements.
They utilize architectures like FPGA overlays, in-memory computing, and neuromorphic fabrics, implementing pipelined and tiled processing elements with runtime reconfiguration and precision scaling.
Performance metrics show enhanced throughput, energy efficiency, and flexible scalability in applications such as deep learning, signal processing, and scientific computing.

A reconfigurable parallel multiplier array is a hardware structure consisting of numerous multiplier processing elements (PEs) organized for concurrent arithmetic operations, where the array can be dynamically re-programmed for different precisions, operand sizes, dataflow patterns, and computational modalities. Designs span FPGA overlay architectures, in-memory compute fabrics, neuromorphic/analog hardware, and novel reconfigurable accelerator architectures. The reconfigurability dimension enables targeting application-specific power/performance/area trade-offs in modern signal processing, ML, or scientific workloads.

1. Architectural Foundations

Reconfigurable multiplier arrays share core architectural paradigms: pipelined and/or tiled arrays of PEs, reconfigurable datapaths for variable precision, hierarchical block structures for spatiotemporal flexibility, and runtime control mechanisms for mode switching.

Bit-serial FPGA overlays: BISMO overlays (Umuroglu et al., 2018, Umuroglu et al., 2019) arrange DPUs in $D_m \times D_n$ arrays, each DPU performing AND-popcount on bit-serial inputs and supporting precision scaling via instruction-programmable decomposition of w-bit operands into binary matrix multiplies. Urdhva Tiryagbhyam and Karatsuba techniques are employed for run-time reconfigurable floating-point multipliers (S et al., 2019).
In-memory computing (IMC): SRAM-based arrays (Lee et al., 2020) implement bit-parallel 6T cell macros with near-memory logic for full-adder-based carry propagation and "add-and-shift" multiplication, offering hardware-controlled bitwidth slicing and energy-efficient BL-boosted compute.
Neuromorphic/resistive fabrics: FPCA architectures (Zidan et al., 2016) exploit monolithic RRAM crossbars, using adaptive row/column address masking and on-chip ADC/DAC interface logic to allocate tiles for storage, digital, or analog multiplication, supporting both digital tree-reduction and analog dot-product computation.

2. Reconfiguration Mechanisms

Reconfigurability is achieved by dynamic switching of precision, mode, or dataflow at runtime via control registers, mode-select inputs, or programmable instruction streams.

Precision Scaling: The bit-serial overlay decomposes multiplications as $P = \sum_{i=0}^{w-1}\sum_{j=0}^{p-1} 2^{i+j}(A^{[i]} \times B^{[j]})$ , with $w \times p$ independent binary GEMMs; hardware supports any precision via software-generated instruction sequences (Umuroglu et al., 2018, Umuroglu et al., 2019). Floating-point multiplier cores use a mode-select register to gate dedicated mantissa multiplier slices and reroute outputs instantaneously (sub-cycle latency) (S et al., 2019). In-memory arrays exploit per-column carry-chain multiplexing and grouping for 2b/4b/8b/16b operation (Lee et al., 2020).
Operational Modes: MARCA's array (Li et al., 2024) employs a 64-bit RCU_CONFIG word to broadcast mode changes (matrix multiply, element-wise, EXP, SiLU) to all PEs, with pipeline flush cost of 2–3 cycles. FPCA uses address and voltage reprogramming for tile assignment or mode-changing (digital, analog, storage) (Zidan et al., 2016).
Tiling and Arithmetic Partitioning: FPGA overlays and SRAM IMC architectures tile matrices into subblocks, dynamically mapping parallel block multiplies (e.g., 16x16 in MARCA, arbitrary N in IMC), and can re-partition tile dimensions at runtime without full resynthesis (Li et al., 2024, Lee et al., 2020).

3. Dataflow and Parallelism

Distinct dataflow organizations enable high arithmetic throughput and memory locality.

Matrix Multiplication Tiling: MARCA (Li et al., 2024) arranges a 16×16 RPE grid per RCU, feeding tile pairs $A_{\text{sub}}, B_{\text{sub}}$ and reducing per-column. BISMO overlays perform fetch-execute-result pipelines across DPU arrays, using buffer banks for input-output decoupling (Umuroglu et al., 2018, Umuroglu et al., 2019). Strassen-based hierarchical blocks optimize PE mapping (S et al., 2019).
Digit-Level Pipelining: Online, MSDF multiplier arrays (Usman et al., 2023) unfold N+δ stages, each computing one product digit per cycle; tile-wise aggregation supports M parallel dot-products.
BL-Compute in IMC: SRAM bit-parallel architectures (Lee et al., 2020) employ simultaneous BL evaluation in all columns, combining with near-memory FA logic for full parallel add-and-shift multiplication, iterating partial sums through dummy array registers.

Architecture	Dataflow Structure	Tile/Block Size
MARCA	16x16 RPE grid + tree	16×16
BISMO	DPU array, pipelined	Configurable
FPCA	M-Core crossbar tiles	32×32, 64×64
IMC SRAM	Bit-parallel columns	128×128

4. Performance, Energy, and Scalability

Performance metrics include arithmetic throughput, energy per operation, scalability, and precision/area trade-offs.

BISMO: Achieves 6.5–15.4 binary TOPS on mid-range FPGAs (PYNQ-Z1, Ultra96), scaling linearly with DPU count and precision (Umuroglu et al., 2018, Umuroglu et al., 2019). Resource usage attains LUT efficiency near 1.3× that of fixed-precision arrays for large DPU widths.
MARCA: Peak throughput of 42.7 GMAC/s per RCU at 1 GHz; energy per MAC measured at 7.6 pJ, ~3×–10× more efficient than fixed architectures, reaching 1.37 TMAC/s on 32 RCUs (Li et al., 2024).
FPCA: 3.39 T double-precision ops/s for arithmetic, 6.55 T ops/s analog BCNN mode; energy-efficient tile-level "count-ones" at sub-1.1 mW (Zidan et al., 2016).
SRAM IMC: At 2.25 GHz, 8.09 TOPS/W for multiplication; area overhead only 5.2% above standard SRAM, scaling latency linearly with precision (Lee et al., 2020).

Precision reconfiguration at run time optimizes power-delay-product for variable application requirements, e.g., reducing mantissa size increases $f_{max}$ by 25% and reduces per-PE energy by up to 70% (S et al., 2019).

5. Reconfigurable Array Control and Buffer Management

Efficient runtime control and data buffering schemes are essential for dynamic workload adaptation.

Instruction-Based Overlay Engines: BISMO and Strassen IP-cores utilize microcoded FSMs and compact programmable instructions (RunFetch, RunExecute, etc.) for decoupling computation from fetch/writeback, exposing on-the-fly adaptation of precision and tiling (Umuroglu et al., 2018, Umuroglu et al., 2019, S et al., 2019).
Mode-Control Broadcasting: MARCA's per-RCU configuration enables atomic mode and reduction-tree switching (Li et al., 2024).
Buffer Management: Intra-op buffer reuse (linear ops) and inter-op buffer partition (element-wise ops, SSM chains) maximize on-chip data reuse in deep ML model execution (Li et al., 2024).
Precision/Size Gating: Online multiplier arrays (Usman et al., 2023) support clock-gating or slice tie-off to dynamically downscale precision and array size in response to runtime needs.

6. Technology-Specific Implementations

Reconfigurable parallel multiplier arrays have been realized in diverse substrates.

FPGA overlays: Employ lookup-table and carry-save architectures (LUT6, DSP48E1), supporting integer/floating-point, multi-precision, and matrix-multiplication workloads (Umuroglu et al., 2018, Umuroglu et al., 2019, S et al., 2019).
SRAM-based IMC: 28nm CMOS 6T cell macros with column-local sensing, BL boosting, and near-memory logic for in-memory multiply/add/compression (Lee et al., 2020).
Resistive crossbars: RRAM-based FPCA with hierarchical M-Core/M-Processor partitioning for analog/digital signal processing (Zidan et al., 2016).
Optical hardware: Coherent free-space setups using SLMs and DMDs for vector-matrix/simultaneous inner-product operations at ~3,000 parallel computations per frame (Spall et al., 2020).
Quantum/reversible logic: Incorporating reversible CIFM multiplier blocks with small 4×4 bit modules for quantum/Nano/optical computing platforms [0610090].

7. Application Domains and Prospective Extensions

Reconfigurable multiplier array architectures target:

Deep neural network inference and training, benefiting from variable-precision arithmetic and parallel block-wise contraction (Li et al., 2024, Lee et al., 2020).
Signal/image processing with in-memory and crossbar structures using BL compute, analog dot-product, and fast pipelined reduction (Zidan et al., 2016, Lee et al., 2020, Usman et al., 2023).
Scientific computing needing Strassen or Karatsuba-optimized matrix multiplication at adjustable precision (S et al., 2019).
Special-purpose accelerators: Optical neural networks, Ising machines, neuromorphic arrays, and quantum processors (Spall et al., 2020) [0610090].

Scaling strategies include expanding block/tile sizes, exploiting higher-speed modulator technology (optical, RRAM), and incorporating software-managed instruction streams for adaptive work partitioning across heterogenous array fabrics.

In summary, reconfigurable parallel multiplier arrays provide a unified hardware approach to high-throughput, architecture-adaptive multiplication across a range of substrates, supporting runtime trade-offs in precision, power, latency, and array size, and underpinning efficient implementation of modern computational kernels in diverse domains 0610090.