FFT Accumulation Method for SCD Estimation
- FAM is a technique that estimates spectral correlation density in cyclostationary signals using efficient FFT operations and optimized algorithmic stages.
- It decomposes the processing into framing, demodulation, and FFT2 stages, enabling real-time signal analysis and hardware acceleration.
- Recent implementations on AMD Versal AiE demonstrate notable speed, energy efficiency, and scalability compared to traditional CPU and GPU approaches.
The FFT Accumulation Method (FAM) is a practical approach for estimating the spectral correlation density (SCD) in cyclostationary signal analysis, using efficient implementations based on the fast Fourier transform (FFT). FAM is extensively applied in real-time signal processing scenarios that require cyclostationary analysis, such as the characterization and detection of human-made signals. Due to the high computational complexity of SCD estimation—even with FFT-based techniques—hardware mapping and algorithmic optimization are essential for feasible real-time processing. Recent work details a high-speed, parallel hardware realization of FAM running on the AMD Versal AI Engine (AiE) array, providing comprehensive insights into algorithmic structure, resource utilization, and comparative efficiency relative to CPUs and GPUs (Li et al., 22 Jun 2025).
1. Mathematical Formulation of FAM
FAM targets the estimation of the spectral correlation density: where is the short-time Fourier transform (STFT) or “complex demodulate” of the discrete input signal at block . In FAM, construction proceeds by generating the complex demodulates through decimation, windowing, and an -point FFT per block: for , where the frame stride is , denotes the analysis window, and is the total number of frames.
By substituting the complex demodulates into the SCD definition and specializing to the autocorrelation case, the FAM estimate is expressed as: where , , and . Convergence to the true SCD is guaranteed as (Li et al., 22 Jun 2025).
2. Algorithmic Decomposition and Computational Stages
FAM implementation decomposes into three primary stages:
- Framing (Data Segmentation):
- The normalized input sequence of length is partitioned into overlapping blocks, each of size , with stride .
- Output: Decimated matrix .
- Demodulate (Windowing, Down-conversion, FFT):
- Each block is windowed (e.g., with a Chebyshev window), phase-shifted for down-conversion, and transformed via an -point FFT to generate the demodulate output .
- FFT2 (Cross-Multiplication and Accumulation):
- All center-frequency bin pairs are enumerated.
- Each pair's sample-wise products form a sequence of length .
- A -point FFT is performed on this sequence; squared magnitudes populate the SCD matrix indexed by cycle frequency and spectral frequency .
3. FAM Pipeline Pseudocode for AiE
A representative pseudocode abstraction (Algorithm 1) for a Versal AiE implementation is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Function Framing(data_in):
data_norm ← Normalise(data_in)
for k=0…P–1 do
X[:,k] ← data_norm[k·L : k·L+N_P–1]
return X
Function Demodulate(X):
for k=0…P–1 do
win ← X(:,k) × chebwin(N_P)
fftout ← FFT_NP(win)
Y(:,k) ← fftout × e^{–j2πkL/N_P}
return Y
Function FFT2(Y):
for m=0…N_P–1 do
for n=0…N_P–1 do
z ← Y(:,m) ∘ conj(Y(:,n))
Zfft ← FFT_P(z)
store |Zfft|^2 into output window |
This sequence matches the Framing → Demodulate → FFT2 pipeline reported for the Versal AiE, in which tiles are statically partitioned to each function (Li et al., 22 Jun 2025).
4. Dataflow Mapping and Hardware Resource Utilization
FAM’s mapping to the AMD Versal VCK5000 AiE array leverages statically partitioned processing tiles for each algorithmic stage. For , , and :
- Framing: 1 normalization tile + 4 channel-router tiles
- Demodulate: 4 Fam_stage1 tiles + 2 Conv_stage1 tiles
- FFT2: 128 Fam_stage2 tiles (handling two frequency channels each)
Total tiles: 137 (34.25% of the AiE’s 400 tiles). On-chip memory per tile is 32 KB DMEM (16 KB per input buffer), mandating careful ping-pong buffering to overlap computation and streaming. AiE↔PL interfaces support up to 234 streams, with FFT2 heavily utilizing 128.
Resource allocation is summarized as:
| Resource | Total Avail. | FAM Usage |
|---|---|---|
| PL Registers | 1,739,432 | 113,686 (6.6%) |
| PL LUTs | 860,336 | 107,601 (12.7%) |
| LUT-as-MEM | 446,367 | 960 (0.2%) |
| BRAM | 933 | 37 (4.0%) |
| URAM | 463 | 0 (0.0%) |
| AiE Tiles | 400 | 137 (34.3%) |
| AiE↔PL IO Streams | 234 | 130 |
The Framing and Demodulate stages are primarily memory-bound due to windowing and repeated FFT_NP usage, while FFT2 is compute-bound (owing to conjugate multiplication and P-point FFT instantiations). The AiE array operates at 1 GHz; programmable logic (PL) engines at 312.5 MHz (Li et al., 22 Jun 2025).
5. Performance Metrics and Comparative Analysis
Execution time and efficiency measured across CPU (Xeon), GPU (RTX 3090), and Versal VCK5000 are as follows:
| Platform | FAM Time | FAM Speedup | SSCA Time | SSCA Speedup |
|---|---|---|---|---|
| CPU (Xeon) | 0.194 s | 1× | 11.3 s | 1× |
| GPU (3090) | 2.791 ms | 69.5× | 217 ms | 52.1× |
| VCK5000 | 0.630 ms | 307.9× | 114 ms | 99.1× |
FAM throughput is 3.25 MS/s on the Versal AiE, with a sustained performance of 189 GFLOP/s (8.6% of theoretical peak), primarily limited by memory movement in Framing and Demodulate stages. Dynamic power usage on VCK5000 is 17 W (idle 23 W, totaling 40 W); for the RTX 3090, dynamic is 117 W (idle 33 W, totaling 150 W). Resultant energy efficiency for FAM execution is 30.5× higher on the Versal AiE compared to the RTX 3090 for equivalent accuracy (Li et al., 22 Jun 2025).
6. Portability and Design Lessons
Several methodological insights enable effective porting of FAM to other hardware platforms:
- Pure AiE implementation eliminates PL⇔AiE data-bus bottlenecks, but mandates rigorous 16 KB DMEM buffer and ping-pong scheme management for effective streaming and compute overlap.
- Tile count is parameterized by a closed-form expression, , enabling scalable design as change.
- Dedicated mapping of conjugate multiplication and FFT processing pairs to separate tiles in FFT2 exploits AiE's parallel multiply-accumulate (MAC) architecture and local DMEM.
- The programmable logic (PL) is reserved for bulk DDR transactions, double-buffered data transposes, and intermediate storage when exceeds on-chip capacity.
- This design methodology—data segmentation, statically partitioned kernels, and hierarchical buffering—is generalizable to other FPGA or AI-Engine-class platforms by adapting buffer and stream parameters.
A plausible implication is that as buffer and stream bandwidths improve, FAM realizations may scale to even higher throughputs and larger SCD problem sizes (Li et al., 22 Jun 2025).
7. Context and Significance
The effective real-time estimation of SCD via FAM, as realized on the Versal AiE, provides a substantial improvement in practical cyclostationary analysis for complex signals. By achieving over 4× speedup and 30× greater energy efficiency compared to state-of-the-art GPU implementations, this hardware-optimized methodology both demonstrates scaling advantages of AI-engine arrays and establishes a methodological template for efficient SCD estimators in emerging FPGA and signal processing platforms (Li et al., 22 Jun 2025).