FFT Accumulation Method for SCD Estimation

Updated 14 February 2026

FAM is a technique that estimates spectral correlation density in cyclostationary signals using efficient FFT operations and optimized algorithmic stages.
It decomposes the processing into framing, demodulation, and FFT2 stages, enabling real-time signal analysis and hardware acceleration.
Recent implementations on AMD Versal AiE demonstrate notable speed, energy efficiency, and scalability compared to traditional CPU and GPU approaches.

The FFT Accumulation Method (FAM) is a practical approach for estimating the spectral correlation density (SCD) in cyclostationary signal analysis, using efficient implementations based on the fast Fourier transform (FFT). FAM is extensively applied in real-time signal processing scenarios that require cyclostationary analysis, such as the characterization and detection of human-made signals. Due to the high computational complexity of SCD estimation—even with FFT-based techniques—hardware mapping and algorithmic optimization are essential for feasible real-time processing. Recent work details a high-speed, parallel hardware realization of FAM running on the AMD Versal AI Engine (AiE) array, providing comprehensive insights into algorithmic structure, resource utilization, and comparative efficiency relative to CPUs and GPUs (Li et al., 22 Jun 2025).

1. Mathematical Formulation of FAM

FAM targets the estimation of the spectral correlation density: $S_x^{\alpha}(f) = \lim_{M\to\infty}\;\frac{1}{M}\sum_{m=0}^{M-1} X(f+\tfrac{\alpha}{2},m)\,X^*(f-\tfrac{\alpha}{2},m)$ where $X(f,m)$ is the short-time Fourier transform (STFT) or “complex demodulate” of the discrete input signal $x(n)$ at block $m$ . In FAM, construction proceeds by generating the complex demodulates through decimation, windowing, and an $N_P$ -point FFT per block: $X_T(pL, f_m) = \sum_{k=0}^{N_P-1} a(d-k)x(pL-d+k) e^{-j2\pi km/N_P} e^{-j2\pi m pL/N_P}$ for $p = 0, \dots, P-1$ , where the frame stride is $L = N_P/4$ , $a(\cdot)$ denotes the analysis window, and $P = 4N/N_P$ is the total number of frames.

By substituting the complex demodulates into the SCD definition and specializing to the autocorrelation case, the FAM estimate is expressed as: $S_x^{a_{kl} + q\Delta\alpha}(pL, f_{kl})_{\Delta t} = \sum_{r=0}^{P-1} X_T(rL, f_k) X_T^*(rL, f_l) g_d(p-r) e^{-j2\pi rq/P}$ where $f_{kl} = (f_k + f_l)/2$ , $\Delta\alpha = f_s/P$ , and $q = \Delta f/\Delta\alpha$ . Convergence to the true SCD is guaranteed as $M \to \infty$ (Li et al., 22 Jun 2025).

2. Algorithmic Decomposition and Computational Stages

FAM implementation decomposes into three primary stages:

Framing (Data Segmentation):
- The normalized input sequence $x(n)$ of length $N$ is partitioned into $P$ overlapping blocks, each of size $N_P$ , with stride $L = N_P/4$ .
- Output: Decimated matrix $X_{\rm De} \in \mathbb{R}^{N_P \times P}$ .
Demodulate (Windowing, Down-conversion, FFT):
- Each block is windowed (e.g., with a Chebyshev window), phase-shifted for down-conversion, and transformed via an $N_P$ -point FFT to generate the demodulate output $X_T(f_m)$ .
FFT2 (Cross-Multiplication and Accumulation):
- All center-frequency bin pairs $(k, l)$ are enumerated.
- Each pair's sample-wise products $X_T(:, f_k) X_T^*( :, f_l)$ form a sequence of length $P$ .
- A $P$ -point FFT is performed on this sequence; squared magnitudes populate the SCD matrix indexed by cycle frequency $\alpha$ and spectral frequency $f$ .

3. FAM Pipeline Pseudocode for AiE

A representative pseudocode abstraction (Algorithm 1) for a Versal AiE implementation is:

Function Framing(data_in):
    data_norm ← Normalise(data_in)
    for k=0…P–1 do
        X[:,k] ← data_norm[k·L : k·L+N_P–1]
    return X

Function Demodulate(X):
    for k=0…P–1 do
        win ← X(:,k) × chebwin(N_P)
        fftout ← FFT_NP(win)
        Y(:,k) ← fftout × e^{–j2πkL/N_P}
    return Y

Function FFT2(Y):
    for m=0…N_P–1 do
        for n=0…N_P–1 do
            z ← Y(:,m) ∘ conj(Y(:,n))
            Zfft ← FFT_P(z)
            store |Zfft|^2 into output window

This sequence matches the Framing → Demodulate → FFT2 pipeline reported for the Versal AiE, in which tiles are statically partitioned to each function (Li et al., 22 Jun 2025).

4. Dataflow Mapping and Hardware Resource Utilization

FAM’s mapping to the AMD Versal VCK5000 AiE array leverages statically partitioned processing tiles for each algorithmic stage. For $N=2048$ , $N_P=256$ , and $F=2048$ :

Framing: 1 normalization tile + 4 channel-router tiles
Demodulate: 4 Fam_stage1 tiles + 2 Conv_stage1 tiles
FFT2: 128 Fam_stage2 tiles (handling two frequency channels each)

Total tiles: 137 (34.25% of the AiE’s 400 tiles). On-chip memory per tile is 32 KB DMEM (16 KB per input buffer), mandating careful ping-pong buffering to overlap computation and streaming. AiE↔PL interfaces support up to 234 streams, with FFT2 heavily utilizing 128.

Resource allocation is summarized as:

Resource	Total Avail.	FAM Usage
PL Registers	1,739,432	113,686 (6.6%)
PL LUTs	860,336	107,601 (12.7%)
LUT-as-MEM	446,367	960 (0.2%)
BRAM	933	37 (4.0%)
URAM	463	0 (0.0%)
AiE Tiles	400	137 (34.3%)
AiE↔PL IO Streams	234	130

The Framing and Demodulate stages are primarily memory-bound due to windowing and repeated FFT_NP usage, while FFT2 is compute-bound (owing to conjugate multiplication and P-point FFT instantiations). The AiE array operates at 1 GHz; programmable logic (PL) engines at 312.5 MHz (Li et al., 22 Jun 2025).

5. Performance Metrics and Comparative Analysis

Execution time and efficiency measured across CPU (Xeon), GPU (RTX 3090), and Versal VCK5000 are as follows:

Platform	FAM Time	FAM Speedup	SSCA Time	SSCA Speedup
CPU (Xeon)	0.194 s	1×	11.3 s	1×
GPU (3090)	2.791 ms	69.5×	217 ms	52.1×
VCK5000	0.630 ms	307.9×	114 ms	99.1×

FAM throughput is 3.25 MS/s on the Versal AiE, with a sustained performance of 189 GFLOP/s (8.6% of theoretical peak), primarily limited by memory movement in Framing and Demodulate stages. Dynamic power usage on VCK5000 is 17 W (idle 23 W, totaling 40 W); for the RTX 3090, dynamic is 117 W (idle 33 W, totaling 150 W). Resultant energy efficiency for FAM execution is 30.5× higher on the Versal AiE compared to the RTX 3090 for equivalent accuracy (Li et al., 22 Jun 2025).

6. Portability and Design Lessons

Several methodological insights enable effective porting of FAM to other hardware platforms:

Pure AiE implementation eliminates PL⇔AiE data-bus bottlenecks, but mandates rigorous 16 KB DMEM buffer and ping-pong scheme management for effective streaming and compute overlap.
Tile count is parameterized by a closed-form expression, $\mathbb{A}_{FAM} = 1 + \lceil 4N/(2F)\rceil + (\lceil 4N/(2F) \rceil + \lceil 4N/F \rceil) + \min(N_P, 128)$ , enabling scalable design as $N, N_P$ change.
Dedicated mapping of conjugate multiplication and FFT processing pairs to separate tiles in FFT2 exploits AiE's parallel multiply-accumulate (MAC) architecture and local DMEM.
The programmable logic (PL) is reserved for bulk DDR transactions, double-buffered data transposes, and intermediate storage when $N \times N_P$ exceeds on-chip capacity.
This design methodology—data segmentation, statically partitioned kernels, and hierarchical buffering—is generalizable to other FPGA or AI-Engine-class platforms by adapting buffer and stream parameters.

A plausible implication is that as buffer and stream bandwidths improve, FAM realizations may scale to even higher throughputs and larger SCD problem sizes (Li et al., 22 Jun 2025).

7. Context and Significance

The effective real-time estimation of SCD via FAM, as realized on the Versal AiE, provides a substantial improvement in practical cyclostationary analysis for complex signals. By achieving over 4× speedup and 30× greater energy efficiency compared to state-of-the-art GPU implementations, this hardware-optimized methodology both demonstrates scaling advantages of AI-engine arrays and establishes a methodological template for efficient SCD estimators in emerging FPGA and signal processing platforms (Li et al., 22 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AMD Versal Implementations of FAM and SSCA Estimators (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FFT Accumulation Method (FAM).