Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reconfigurable Parallel Multiplier Array

Updated 30 January 2026
  • Reconfigurable parallel multiplier arrays are hardware structures that dynamically adjust precision, operand sizes, and dataflow patterns to meet specific performance, power, and area requirements.
  • They utilize architectures like FPGA overlays, in-memory computing, and neuromorphic fabrics, implementing pipelined and tiled processing elements with runtime reconfiguration and precision scaling.
  • Performance metrics show enhanced throughput, energy efficiency, and flexible scalability in applications such as deep learning, signal processing, and scientific computing.

A reconfigurable parallel multiplier array is a hardware structure consisting of numerous multiplier processing elements (PEs) organized for concurrent arithmetic operations, where the array can be dynamically re-programmed for different precisions, operand sizes, dataflow patterns, and computational modalities. Designs span FPGA overlay architectures, in-memory compute fabrics, neuromorphic/analog hardware, and novel reconfigurable accelerator architectures. The reconfigurability dimension enables targeting application-specific power/performance/area trade-offs in modern signal processing, ML, or scientific workloads.

1. Architectural Foundations

Reconfigurable multiplier arrays share core architectural paradigms: pipelined and/or tiled arrays of PEs, reconfigurable datapaths for variable precision, hierarchical block structures for spatiotemporal flexibility, and runtime control mechanisms for mode switching.

  • Bit-serial FPGA overlays: BISMO overlays (Umuroglu et al., 2018, Umuroglu et al., 2019) arrange DPUs in Dm×DnD_m \times D_n arrays, each DPU performing AND-popcount on bit-serial inputs and supporting precision scaling via instruction-programmable decomposition of w-bit operands into binary matrix multiplies. Urdhva Tiryagbhyam and Karatsuba techniques are employed for run-time reconfigurable floating-point multipliers (S et al., 2019).
  • In-memory computing (IMC): SRAM-based arrays (Lee et al., 2020) implement bit-parallel 6T cell macros with near-memory logic for full-adder-based carry propagation and "add-and-shift" multiplication, offering hardware-controlled bitwidth slicing and energy-efficient BL-boosted compute.
  • Neuromorphic/resistive fabrics: FPCA architectures (Zidan et al., 2016) exploit monolithic RRAM crossbars, using adaptive row/column address masking and on-chip ADC/DAC interface logic to allocate tiles for storage, digital, or analog multiplication, supporting both digital tree-reduction and analog dot-product computation.

2. Reconfiguration Mechanisms

Reconfigurability is achieved by dynamic switching of precision, mode, or dataflow at runtime via control registers, mode-select inputs, or programmable instruction streams.

  • Precision Scaling: The bit-serial overlay decomposes multiplications as P=∑i=0w−1∑j=0p−12i+j(A[i]×B[j])P = \sum_{i=0}^{w-1}\sum_{j=0}^{p-1} 2^{i+j}(A^{[i]} \times B^{[j]}), with w×pw \times p independent binary GEMMs; hardware supports any precision via software-generated instruction sequences (Umuroglu et al., 2018, Umuroglu et al., 2019). Floating-point multiplier cores use a mode-select register to gate dedicated mantissa multiplier slices and reroute outputs instantaneously (sub-cycle latency) (S et al., 2019). In-memory arrays exploit per-column carry-chain multiplexing and grouping for 2b/4b/8b/16b operation (Lee et al., 2020).
  • Operational Modes: MARCA's array (Li et al., 2024) employs a 64-bit RCU_CONFIG word to broadcast mode changes (matrix multiply, element-wise, EXP, SiLU) to all PEs, with pipeline flush cost of 2–3 cycles. FPCA uses address and voltage reprogramming for tile assignment or mode-changing (digital, analog, storage) (Zidan et al., 2016).
  • Tiling and Arithmetic Partitioning: FPGA overlays and SRAM IMC architectures tile matrices into subblocks, dynamically mapping parallel block multiplies (e.g., 16x16 in MARCA, arbitrary N in IMC), and can re-partition tile dimensions at runtime without full resynthesis (Li et al., 2024, Lee et al., 2020).

3. Dataflow and Parallelism

Distinct dataflow organizations enable high arithmetic throughput and memory locality.

  • Matrix Multiplication Tiling: MARCA (Li et al., 2024) arranges a 16×16 RPE grid per RCU, feeding tile pairs Asub,BsubA_{\text{sub}}, B_{\text{sub}} and reducing per-column. BISMO overlays perform fetch-execute-result pipelines across DPU arrays, using buffer banks for input-output decoupling (Umuroglu et al., 2018, Umuroglu et al., 2019). Strassen-based hierarchical blocks optimize PE mapping (S et al., 2019).
  • Digit-Level Pipelining: Online, MSDF multiplier arrays (Usman et al., 2023) unfold N+δ stages, each computing one product digit per cycle; tile-wise aggregation supports M parallel dot-products.
  • BL-Compute in IMC: SRAM bit-parallel architectures (Lee et al., 2020) employ simultaneous BL evaluation in all columns, combining with near-memory FA logic for full parallel add-and-shift multiplication, iterating partial sums through dummy array registers.
Architecture Dataflow Structure Tile/Block Size
MARCA 16x16 RPE grid + tree 16×16
BISMO DPU array, pipelined Configurable
FPCA M-Core crossbar tiles 32×32, 64×64
IMC SRAM Bit-parallel columns 128×128

4. Performance, Energy, and Scalability

Performance metrics include arithmetic throughput, energy per operation, scalability, and precision/area trade-offs.

  • BISMO: Achieves 6.5–15.4 binary TOPS on mid-range FPGAs (PYNQ-Z1, Ultra96), scaling linearly with DPU count and precision (Umuroglu et al., 2018, Umuroglu et al., 2019). Resource usage attains LUT efficiency near 1.3× that of fixed-precision arrays for large DPU widths.
  • MARCA: Peak throughput of 42.7 GMAC/s per RCU at 1 GHz; energy per MAC measured at 7.6 pJ, ~3×–10× more efficient than fixed architectures, reaching 1.37 TMAC/s on 32 RCUs (Li et al., 2024).
  • FPCA: 3.39 T double-precision ops/s for arithmetic, 6.55 T ops/s analog BCNN mode; energy-efficient tile-level "count-ones" at sub-1.1 mW (Zidan et al., 2016).
  • SRAM IMC: At 2.25 GHz, 8.09 TOPS/W for multiplication; area overhead only 5.2% above standard SRAM, scaling latency linearly with precision (Lee et al., 2020).

Precision reconfiguration at run time optimizes power-delay-product for variable application requirements, e.g., reducing mantissa size increases fmaxf_{max} by 25% and reduces per-PE energy by up to 70% (S et al., 2019).

5. Reconfigurable Array Control and Buffer Management

Efficient runtime control and data buffering schemes are essential for dynamic workload adaptation.

  • Instruction-Based Overlay Engines: BISMO and Strassen IP-cores utilize microcoded FSMs and compact programmable instructions (RunFetch, RunExecute, etc.) for decoupling computation from fetch/writeback, exposing on-the-fly adaptation of precision and tiling (Umuroglu et al., 2018, Umuroglu et al., 2019, S et al., 2019).
  • Mode-Control Broadcasting: MARCA's per-RCU configuration enables atomic mode and reduction-tree switching (Li et al., 2024).
  • Buffer Management: Intra-op buffer reuse (linear ops) and inter-op buffer partition (element-wise ops, SSM chains) maximize on-chip data reuse in deep ML model execution (Li et al., 2024).
  • Precision/Size Gating: Online multiplier arrays (Usman et al., 2023) support clock-gating or slice tie-off to dynamically downscale precision and array size in response to runtime needs.

6. Technology-Specific Implementations

Reconfigurable parallel multiplier arrays have been realized in diverse substrates.

  • FPGA overlays: Employ lookup-table and carry-save architectures (LUT6, DSP48E1), supporting integer/floating-point, multi-precision, and matrix-multiplication workloads (Umuroglu et al., 2018, Umuroglu et al., 2019, S et al., 2019).
  • SRAM-based IMC: 28nm CMOS 6T cell macros with column-local sensing, BL boosting, and near-memory logic for in-memory multiply/add/compression (Lee et al., 2020).
  • Resistive crossbars: RRAM-based FPCA with hierarchical M-Core/M-Processor partitioning for analog/digital signal processing (Zidan et al., 2016).
  • Optical hardware: Coherent free-space setups using SLMs and DMDs for vector-matrix/simultaneous inner-product operations at ~3,000 parallel computations per frame (Spall et al., 2020).
  • Quantum/reversible logic: Incorporating reversible CIFM multiplier blocks with small 4×4 bit modules for quantum/Nano/optical computing platforms [0610090].

7. Application Domains and Prospective Extensions

Reconfigurable multiplier array architectures target:

  • Deep neural network inference and training, benefiting from variable-precision arithmetic and parallel block-wise contraction (Li et al., 2024, Lee et al., 2020).
  • Signal/image processing with in-memory and crossbar structures using BL compute, analog dot-product, and fast pipelined reduction (Zidan et al., 2016, Lee et al., 2020, Usman et al., 2023).
  • Scientific computing needing Strassen or Karatsuba-optimized matrix multiplication at adjustable precision (S et al., 2019).
  • Special-purpose accelerators: Optical neural networks, Ising machines, neuromorphic arrays, and quantum processors (Spall et al., 2020) [0610090].

Scaling strategies include expanding block/tile sizes, exploiting higher-speed modulator technology (optical, RRAM), and incorporating software-managed instruction streams for adaptive work partitioning across heterogenous array fabrics.

In summary, reconfigurable parallel multiplier arrays provide a unified hardware approach to high-throughput, architecture-adaptive multiplication across a range of substrates, supporting runtime trade-offs in precision, power, latency, and array size, and underpinning efficient implementation of modern computational kernels in diverse domains 0610090.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reconfigurable Parallel Multiplier Array.