Reconfigurable Parallel Multiplier Array
- Reconfigurable parallel multiplier arrays are hardware structures that dynamically adjust precision, operand sizes, and dataflow patterns to meet specific performance, power, and area requirements.
- They utilize architectures like FPGA overlays, in-memory computing, and neuromorphic fabrics, implementing pipelined and tiled processing elements with runtime reconfiguration and precision scaling.
- Performance metrics show enhanced throughput, energy efficiency, and flexible scalability in applications such as deep learning, signal processing, and scientific computing.
A reconfigurable parallel multiplier array is a hardware structure consisting of numerous multiplier processing elements (PEs) organized for concurrent arithmetic operations, where the array can be dynamically re-programmed for different precisions, operand sizes, dataflow patterns, and computational modalities. Designs span FPGA overlay architectures, in-memory compute fabrics, neuromorphic/analog hardware, and novel reconfigurable accelerator architectures. The reconfigurability dimension enables targeting application-specific power/performance/area trade-offs in modern signal processing, ML, or scientific workloads.
1. Architectural Foundations
Reconfigurable multiplier arrays share core architectural paradigms: pipelined and/or tiled arrays of PEs, reconfigurable datapaths for variable precision, hierarchical block structures for spatiotemporal flexibility, and runtime control mechanisms for mode switching.
- Bit-serial FPGA overlays: BISMO overlays (Umuroglu et al., 2018, Umuroglu et al., 2019) arrange DPUs in arrays, each DPU performing AND-popcount on bit-serial inputs and supporting precision scaling via instruction-programmable decomposition of w-bit operands into binary matrix multiplies. Urdhva Tiryagbhyam and Karatsuba techniques are employed for run-time reconfigurable floating-point multipliers (S et al., 2019).
- In-memory computing (IMC): SRAM-based arrays (Lee et al., 2020) implement bit-parallel 6T cell macros with near-memory logic for full-adder-based carry propagation and "add-and-shift" multiplication, offering hardware-controlled bitwidth slicing and energy-efficient BL-boosted compute.
- Neuromorphic/resistive fabrics: FPCA architectures (Zidan et al., 2016) exploit monolithic RRAM crossbars, using adaptive row/column address masking and on-chip ADC/DAC interface logic to allocate tiles for storage, digital, or analog multiplication, supporting both digital tree-reduction and analog dot-product computation.
2. Reconfiguration Mechanisms
Reconfigurability is achieved by dynamic switching of precision, mode, or dataflow at runtime via control registers, mode-select inputs, or programmable instruction streams.
- Precision Scaling: The bit-serial overlay decomposes multiplications as , with independent binary GEMMs; hardware supports any precision via software-generated instruction sequences (Umuroglu et al., 2018, Umuroglu et al., 2019). Floating-point multiplier cores use a mode-select register to gate dedicated mantissa multiplier slices and reroute outputs instantaneously (sub-cycle latency) (S et al., 2019). In-memory arrays exploit per-column carry-chain multiplexing and grouping for 2b/4b/8b/16b operation (Lee et al., 2020).
- Operational Modes: MARCA's array (Li et al., 2024) employs a 64-bit RCU_CONFIG word to broadcast mode changes (matrix multiply, element-wise, EXP, SiLU) to all PEs, with pipeline flush cost of 2–3 cycles. FPCA uses address and voltage reprogramming for tile assignment or mode-changing (digital, analog, storage) (Zidan et al., 2016).
- Tiling and Arithmetic Partitioning: FPGA overlays and SRAM IMC architectures tile matrices into subblocks, dynamically mapping parallel block multiplies (e.g., 16x16 in MARCA, arbitrary N in IMC), and can re-partition tile dimensions at runtime without full resynthesis (Li et al., 2024, Lee et al., 2020).
3. Dataflow and Parallelism
Distinct dataflow organizations enable high arithmetic throughput and memory locality.
- Matrix Multiplication Tiling: MARCA (Li et al., 2024) arranges a 16×16 RPE grid per RCU, feeding tile pairs and reducing per-column. BISMO overlays perform fetch-execute-result pipelines across DPU arrays, using buffer banks for input-output decoupling (Umuroglu et al., 2018, Umuroglu et al., 2019). Strassen-based hierarchical blocks optimize PE mapping (S et al., 2019).
- Digit-Level Pipelining: Online, MSDF multiplier arrays (Usman et al., 2023) unfold N+δ stages, each computing one product digit per cycle; tile-wise aggregation supports M parallel dot-products.
- BL-Compute in IMC: SRAM bit-parallel architectures (Lee et al., 2020) employ simultaneous BL evaluation in all columns, combining with near-memory FA logic for full parallel add-and-shift multiplication, iterating partial sums through dummy array registers.
| Architecture | Dataflow Structure | Tile/Block Size |
|---|---|---|
| MARCA | 16x16 RPE grid + tree | 16×16 |
| BISMO | DPU array, pipelined | Configurable |
| FPCA | M-Core crossbar tiles | 32×32, 64×64 |
| IMC SRAM | Bit-parallel columns | 128×128 |
4. Performance, Energy, and Scalability
Performance metrics include arithmetic throughput, energy per operation, scalability, and precision/area trade-offs.
- BISMO: Achieves 6.5–15.4 binary TOPS on mid-range FPGAs (PYNQ-Z1, Ultra96), scaling linearly with DPU count and precision (Umuroglu et al., 2018, Umuroglu et al., 2019). Resource usage attains LUT efficiency near 1.3× that of fixed-precision arrays for large DPU widths.
- MARCA: Peak throughput of 42.7 GMAC/s per RCU at 1 GHz; energy per MAC measured at 7.6 pJ, ~3×–10× more efficient than fixed architectures, reaching 1.37 TMAC/s on 32 RCUs (Li et al., 2024).
- FPCA: 3.39 T double-precision ops/s for arithmetic, 6.55 T ops/s analog BCNN mode; energy-efficient tile-level "count-ones" at sub-1.1 mW (Zidan et al., 2016).
- SRAM IMC: At 2.25 GHz, 8.09 TOPS/W for multiplication; area overhead only 5.2% above standard SRAM, scaling latency linearly with precision (Lee et al., 2020).
Precision reconfiguration at run time optimizes power-delay-product for variable application requirements, e.g., reducing mantissa size increases by 25% and reduces per-PE energy by up to 70% (S et al., 2019).
5. Reconfigurable Array Control and Buffer Management
Efficient runtime control and data buffering schemes are essential for dynamic workload adaptation.
- Instruction-Based Overlay Engines: BISMO and Strassen IP-cores utilize microcoded FSMs and compact programmable instructions (RunFetch, RunExecute, etc.) for decoupling computation from fetch/writeback, exposing on-the-fly adaptation of precision and tiling (Umuroglu et al., 2018, Umuroglu et al., 2019, S et al., 2019).
- Mode-Control Broadcasting: MARCA's per-RCU configuration enables atomic mode and reduction-tree switching (Li et al., 2024).
- Buffer Management: Intra-op buffer reuse (linear ops) and inter-op buffer partition (element-wise ops, SSM chains) maximize on-chip data reuse in deep ML model execution (Li et al., 2024).
- Precision/Size Gating: Online multiplier arrays (Usman et al., 2023) support clock-gating or slice tie-off to dynamically downscale precision and array size in response to runtime needs.
6. Technology-Specific Implementations
Reconfigurable parallel multiplier arrays have been realized in diverse substrates.
- FPGA overlays: Employ lookup-table and carry-save architectures (LUT6, DSP48E1), supporting integer/floating-point, multi-precision, and matrix-multiplication workloads (Umuroglu et al., 2018, Umuroglu et al., 2019, S et al., 2019).
- SRAM-based IMC: 28nm CMOS 6T cell macros with column-local sensing, BL boosting, and near-memory logic for in-memory multiply/add/compression (Lee et al., 2020).
- Resistive crossbars: RRAM-based FPCA with hierarchical M-Core/M-Processor partitioning for analog/digital signal processing (Zidan et al., 2016).
- Optical hardware: Coherent free-space setups using SLMs and DMDs for vector-matrix/simultaneous inner-product operations at ~3,000 parallel computations per frame (Spall et al., 2020).
- Quantum/reversible logic: Incorporating reversible CIFM multiplier blocks with small 4×4 bit modules for quantum/Nano/optical computing platforms [0610090].
7. Application Domains and Prospective Extensions
Reconfigurable multiplier array architectures target:
- Deep neural network inference and training, benefiting from variable-precision arithmetic and parallel block-wise contraction (Li et al., 2024, Lee et al., 2020).
- Signal/image processing with in-memory and crossbar structures using BL compute, analog dot-product, and fast pipelined reduction (Zidan et al., 2016, Lee et al., 2020, Usman et al., 2023).
- Scientific computing needing Strassen or Karatsuba-optimized matrix multiplication at adjustable precision (S et al., 2019).
- Special-purpose accelerators: Optical neural networks, Ising machines, neuromorphic arrays, and quantum processors (Spall et al., 2020) [0610090].
Scaling strategies include expanding block/tile sizes, exploiting higher-speed modulator technology (optical, RRAM), and incorporating software-managed instruction streams for adaptive work partitioning across heterogenous array fabrics.
In summary, reconfigurable parallel multiplier arrays provide a unified hardware approach to high-throughput, architecture-adaptive multiplication across a range of substrates, supporting runtime trade-offs in precision, power, latency, and array size, and underpinning efficient implementation of modern computational kernels in diverse domains 0610090.