LoReB Hybrid Pipeline for FFT Computation
- The paper presents a unified architecture integrating pipelined and memory-based methods to reduce FFT stage count by up to 40%.
- LoReB Hybrid Pipeline is defined by its modular reconfigurability using high-radix MDC units and advanced permutation schemes for conflict-free memory access.
- Empirical benchmarks on FPGA show superior throughput and area efficiency, with hardware utilization rates of 60-80% for varying degrees of parallelism.
The LoReB Hybrid Pipeline denotes an architecture class integrating pipelined and memory-based strategies for fast Fourier transform (FFT) computation, particularly utilising high-radix multi-path delay commutator (MDC) modules and sophisticated permutation schemes to ensure adaptability, high utilization, and conflict-free memory access on large-scale hardware. It is characterized by modular reconfigurability between continuous-flow (pipelined) and in-place (memory-based) operation for various signal sizes and degrees of parallelism, as detailed in "Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix- FFT in Large Size Processing" (Zhao et al., 2 Jan 2025).
1. Architectural Overview
The pipeline combines front-end pipelined MDC units (enabling high throughput) with a back-end in-place memory-processing subsystem (optimized for area efficiency during large-size signal handling). Its block-level structure includes:
- Data-Reordering Module: 2P single-port memory banks for real/imaginary samples, with address-generation and bit-dimension permutation units (denoted as σ₁, σ₂, σ₃).
- FFT Core Processor: P parallel MDC units, each instantiated for radix- butterfly structures.
Operational modes:
- Pipeline Mode (): Each stage’s MDCs operate across two independent bank sets; output flows horizontally, stage by stage, maximizing data throughput.
- Memory-Based Mode (): A single set of banks is reused; data is reordered and streamed in-place for multiple iterations, focusing on hardware area conservation.
2. Radix- MDC Unit Design
The pipeline implements generalized DFT computation: with multi-dimensional index mappings for radix- factorization. Stage applies radix-, where , , and stages. Twiddle factors are decomposed into constant (), trivial (), and non-trivial () rotators, positioned between delay lines and butterfly chains.
Each MDC block comprises:
- A chain of radix-2 butterflies for inputs.
- Rotator circuits realizing , , and twiddle factor multiplications.
- Multiplexers for selectively bypassing hardware under lower-radix operation.
Reference graph instantiation for radix- (32-point) comprises five stages coded as C→T→C→T→NT.
3. Conflict-Free Memory Access
To preserve continuous dataflow and atomic read/write transactions:
- Pipeline Mode: Circular counters generate raw addresses partitioned into bank and offset fields. Bit permutations (σ₁ for reads, σ_Ri/σ_Wi for write/read interleaving) alternate natural and reversed bit patterns, strictly avoiding cross-bank collisions.
- Memory-Based Mode: Dual complementary permutations (σ_{N,1} and inverse-prefix ) are composed to compute memory access paths () ensuring address space disjointness at each iteration.
In both modes, bank mapping resolves the MSBs of permuted addresses, with LSBs used for intra-bank offsetting.
4. Bit-Dimension Permutation Strategy
End-to-end permutation from stage output to stage input is formalized as:
- σ₁ (Serial-Bit Reversal): Inverts LSBs, where is context-sensitive to radix, bank count, and stage index.
- σ₂ (Parallel-Branch Reversal): Reverses MSBs across parallel banks.
- σ₃ (Reshuffle): Swaps bits between offsets and branches as required by MDC interconnect, potentially cascading up to six stages for large and depth.
5. Data Flow and Reordering Algorithm
Stage-wise data handling is encapsulated in the following pseudo-code (direct from (Zhao et al., 2 Jan 2025)):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
for stage s=1 to S: compute radix k_s compute block-size B=2^k_s compute w = w_N^{s,k,P} for circular_counter c in 0 … (N/2P)-1: raw_addr = (c_{n-1}…c_0) read_addr = σ₁(raw_addr; w) # serial-bit reversal bank_id = top p bits of read_addr offset = lower n-p bits of read_addr data[bank_id][offset] → fetch P×B samples permuted_branches = σ₂( data ) for each required σ₃ stage: permuted_branches = reshuffle(permuted_branches; h, l) result_blocks = MDC_radix(permuted_branches, twiddles) write_addr = matching write-permutation(read_addr) for p=0 to P-1: bank_id' = top p bits of write_addr offset' = lower bits of write_addr memory[bank_id'][offset'] ← result_blocks[p] |
6. Benchmark Results and Performance Analysis
The LoReB design yields improved iteration counts, computation times, and hardware utilization versus prior art:
| Design | Iterations | Cycle Time (for ) |
|---|---|---|
| Tsai’11 (radix-2…2³) | ||
| Kaya’23 (radix-2) | ||
| Wang’20 (radix-2…2³) | ||
| LoReB (radix-2…2⁵) |
FPGA implementation (Xilinx VCU118, xcvu9p) delivers:
- DSP48E2 slices:
- LUTs:
- Flip-Flops:
- BlockRAM: $444$ × 36 Kb
- UltraRAM: $768$ × 288 Kb
- Maximum Clock: $196.8$ MHz
- Throughput for , : approximately
Hardware utilization data indicate LoReB maintains utilization for , and for deeper parallelism (see comparison with Garrido’13 [radix-2⁵]).
7. Contextual Significance and Implications
LoReB’s key contributions are:
- Unified architecture for both pipelined and memory-based FFT operations.
- Radix- MDC implementation enabling up to reduction in stage count versus radix-2/2³ designs.
- General permutation strategies compatible with arbitrary signal length, radix, and parallelism.
- Demonstrated superior hardware resource efficiency and throughput for large-scale applications.
The conflict-free permutation logic and MDC-based computation position LoReB as a scalable solution for configurable, high-performance FFT processors in contemporary digital signal processing pipelines (Zhao et al., 2 Jan 2025). A plausible implication is applicability in domains requiring reconfigurable throughput-to-area trade-offs without manual redesign. The formalization of data-dependent address permutations also suggests potential cross-applications in more general high-radix signal transformation circuits where data contention is a bottleneck.