Papers
Topics
Authors
Recent
Search
2000 character limit reached

FFT Accumulation Method for SCD Estimation

Updated 14 February 2026
  • FAM is a technique that estimates spectral correlation density in cyclostationary signals using efficient FFT operations and optimized algorithmic stages.
  • It decomposes the processing into framing, demodulation, and FFT2 stages, enabling real-time signal analysis and hardware acceleration.
  • Recent implementations on AMD Versal AiE demonstrate notable speed, energy efficiency, and scalability compared to traditional CPU and GPU approaches.

The FFT Accumulation Method (FAM) is a practical approach for estimating the spectral correlation density (SCD) in cyclostationary signal analysis, using efficient implementations based on the fast Fourier transform (FFT). FAM is extensively applied in real-time signal processing scenarios that require cyclostationary analysis, such as the characterization and detection of human-made signals. Due to the high computational complexity of SCD estimation—even with FFT-based techniques—hardware mapping and algorithmic optimization are essential for feasible real-time processing. Recent work details a high-speed, parallel hardware realization of FAM running on the AMD Versal AI Engine (AiE) array, providing comprehensive insights into algorithmic structure, resource utilization, and comparative efficiency relative to CPUs and GPUs (Li et al., 22 Jun 2025).

1. Mathematical Formulation of FAM

FAM targets the estimation of the spectral correlation density: Sxα(f)=limM  1Mm=0M1X(f+α2,m)X(fα2,m)S_x^{\alpha}(f) = \lim_{M\to\infty}\;\frac{1}{M}\sum_{m=0}^{M-1} X(f+\tfrac{\alpha}{2},m)\,X^*(f-\tfrac{\alpha}{2},m) where X(f,m)X(f,m) is the short-time Fourier transform (STFT) or “complex demodulate” of the discrete input signal x(n)x(n) at block mm. In FAM, construction proceeds by generating the complex demodulates through decimation, windowing, and an NPN_P-point FFT per block: XT(pL,fm)=k=0NP1a(dk)x(pLd+k)ej2πkm/NPej2πmpL/NPX_T(pL, f_m) = \sum_{k=0}^{N_P-1} a(d-k)x(pL-d+k) e^{-j2\pi km/N_P} e^{-j2\pi m pL/N_P} for p=0,,P1p = 0, \dots, P-1, where the frame stride is L=NP/4L = N_P/4, a()a(\cdot) denotes the analysis window, and P=4N/NPP = 4N/N_P is the total number of frames.

By substituting the complex demodulates into the SCD definition and specializing to the autocorrelation case, the FAM estimate is expressed as: Sxakl+qΔα(pL,fkl)Δt=r=0P1XT(rL,fk)XT(rL,fl)gd(pr)ej2πrq/PS_x^{a_{kl} + q\Delta\alpha}(pL, f_{kl})_{\Delta t} = \sum_{r=0}^{P-1} X_T(rL, f_k) X_T^*(rL, f_l) g_d(p-r) e^{-j2\pi rq/P} where fkl=(fk+fl)/2f_{kl} = (f_k + f_l)/2, Δα=fs/P\Delta\alpha = f_s/P, and q=Δf/Δαq = \Delta f/\Delta\alpha. Convergence to the true SCD is guaranteed as MM \to \infty (Li et al., 22 Jun 2025).

2. Algorithmic Decomposition and Computational Stages

FAM implementation decomposes into three primary stages:

  1. Framing (Data Segmentation):
    • The normalized input sequence x(n)x(n) of length NN is partitioned into PP overlapping blocks, each of size NPN_P, with stride L=NP/4L = N_P/4.
    • Output: Decimated matrix XDeRNP×PX_{\rm De} \in \mathbb{R}^{N_P \times P}.
  2. Demodulate (Windowing, Down-conversion, FFT):
    • Each block is windowed (e.g., with a Chebyshev window), phase-shifted for down-conversion, and transformed via an NPN_P-point FFT to generate the demodulate output XT(fm)X_T(f_m).
  3. FFT2 (Cross-Multiplication and Accumulation):
    • All center-frequency bin pairs (k,l)(k, l) are enumerated.
    • Each pair's sample-wise products XT(:,fk)XT(:,fl)X_T(:, f_k) X_T^*( :, f_l) form a sequence of length PP.
    • A PP-point FFT is performed on this sequence; squared magnitudes populate the SCD matrix indexed by cycle frequency α\alpha and spectral frequency ff.

3. FAM Pipeline Pseudocode for AiE

A representative pseudocode abstraction (Algorithm 1) for a Versal AiE implementation is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Function Framing(data_in):
    data_norm ← Normalise(data_in)
    for k=0…P–1 do
        X[:,k] ← data_norm[k·L : k·L+N_P–1]
    return X

Function Demodulate(X):
    for k=0…P–1 do
        win ← X(:,k) × chebwin(N_P)
        fftout ← FFT_NP(win)
        Y(:,k) ← fftout × e^{–j2πkL/N_P}
    return Y

Function FFT2(Y):
    for m=0…N_P–1 do
        for n=0…N_P–1 do
            z ← Y(:,m) ∘ conj(Y(:,n))
            Zfft ← FFT_P(z)
            store |Zfft|^2 into output window

This sequence matches the Framing → Demodulate → FFT2 pipeline reported for the Versal AiE, in which tiles are statically partitioned to each function (Li et al., 22 Jun 2025).

4. Dataflow Mapping and Hardware Resource Utilization

FAM’s mapping to the AMD Versal VCK5000 AiE array leverages statically partitioned processing tiles for each algorithmic stage. For N=2048N=2048, NP=256N_P=256, and F=2048F=2048:

  • Framing: 1 normalization tile + 4 channel-router tiles
  • Demodulate: 4 Fam_stage1 tiles + 2 Conv_stage1 tiles
  • FFT2: 128 Fam_stage2 tiles (handling two frequency channels each)

Total tiles: 137 (34.25% of the AiE’s 400 tiles). On-chip memory per tile is 32 KB DMEM (16 KB per input buffer), mandating careful ping-pong buffering to overlap computation and streaming. AiE↔PL interfaces support up to 234 streams, with FFT2 heavily utilizing 128.

Resource allocation is summarized as:

Resource Total Avail. FAM Usage
PL Registers 1,739,432 113,686 (6.6%)
PL LUTs 860,336 107,601 (12.7%)
LUT-as-MEM 446,367 960 (0.2%)
BRAM 933 37 (4.0%)
URAM 463 0 (0.0%)
AiE Tiles 400 137 (34.3%)
AiE↔PL IO Streams 234 130

The Framing and Demodulate stages are primarily memory-bound due to windowing and repeated FFT_NP usage, while FFT2 is compute-bound (owing to conjugate multiplication and P-point FFT instantiations). The AiE array operates at 1 GHz; programmable logic (PL) engines at 312.5 MHz (Li et al., 22 Jun 2025).

5. Performance Metrics and Comparative Analysis

Execution time and efficiency measured across CPU (Xeon), GPU (RTX 3090), and Versal VCK5000 are as follows:

Platform FAM Time FAM Speedup SSCA Time SSCA Speedup
CPU (Xeon) 0.194 s 11.3 s
GPU (3090) 2.791 ms 69.5× 217 ms 52.1×
VCK5000 0.630 ms 307.9× 114 ms 99.1×

FAM throughput is 3.25 MS/s on the Versal AiE, with a sustained performance of 189 GFLOP/s (8.6% of theoretical peak), primarily limited by memory movement in Framing and Demodulate stages. Dynamic power usage on VCK5000 is 17 W (idle 23 W, totaling 40 W); for the RTX 3090, dynamic is 117 W (idle 33 W, totaling 150 W). Resultant energy efficiency for FAM execution is 30.5× higher on the Versal AiE compared to the RTX 3090 for equivalent accuracy (Li et al., 22 Jun 2025).

6. Portability and Design Lessons

Several methodological insights enable effective porting of FAM to other hardware platforms:

  • Pure AiE implementation eliminates PL⇔AiE data-bus bottlenecks, but mandates rigorous 16 KB DMEM buffer and ping-pong scheme management for effective streaming and compute overlap.
  • Tile count is parameterized by a closed-form expression, AFAM=1+4N/(2F)+(4N/(2F)+4N/F)+min(NP,128)\mathbb{A}_{FAM} = 1 + \lceil 4N/(2F)\rceil + (\lceil 4N/(2F) \rceil + \lceil 4N/F \rceil) + \min(N_P, 128), enabling scalable design as N,NPN, N_P change.
  • Dedicated mapping of conjugate multiplication and FFT processing pairs to separate tiles in FFT2 exploits AiE's parallel multiply-accumulate (MAC) architecture and local DMEM.
  • The programmable logic (PL) is reserved for bulk DDR transactions, double-buffered data transposes, and intermediate storage when N×NPN \times N_P exceeds on-chip capacity.
  • This design methodology—data segmentation, statically partitioned kernels, and hierarchical buffering—is generalizable to other FPGA or AI-Engine-class platforms by adapting buffer and stream parameters.

A plausible implication is that as buffer and stream bandwidths improve, FAM realizations may scale to even higher throughputs and larger SCD problem sizes (Li et al., 22 Jun 2025).

7. Context and Significance

The effective real-time estimation of SCD via FAM, as realized on the Versal AiE, provides a substantial improvement in practical cyclostationary analysis for complex signals. By achieving over 4× speedup and 30× greater energy efficiency compared to state-of-the-art GPU implementations, this hardware-optimized methodology both demonstrates scaling advantages of AI-engine arrays and establishes a methodological template for efficient SCD estimators in emerging FPGA and signal processing platforms (Li et al., 22 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FFT Accumulation Method (FAM).