Griffin-Lim Phase Estimation

Updated 26 January 2026

The Griffin-Lim Phase Estimation Algorithm is an iterative method that reconstructs time-domain signals from magnitude-only STFT representations using alternating projections.
Accelerated variants like FGLA and AGLA reduce iterations by up to 3.5× while preserving or enhancing signal quality in audio applications.
Real-time implementations and deep learning integrations have extended its use in speech synthesis, audio restoration, and music enhancement.

The Griffin-Lim Phase Estimation Algorithm is a foundational iterative scheme for reconstructing time-domain signals from magnitude-only short-time Fourier transform (STFT) representations. It has underpinned developments across speech synthesis, audio restoration, music enhancement, and phase retrieval, with a range of improvements and adaptations utilizing its alternating projection principle.

1. Mathematical Formulation of Phase Retrieval and the Griffin-Lim Algorithm

The core phase-retrieval problem is, given a magnitude spectrogram $s \in \mathbb{R}_+^{M \times N}$ , to find a time-domain signal $x \in \mathbb{R}^L$ such that $|Gx| \approx s$ , where $G$ is the STFT operator. This is typically cast as a nonconvex least-squares minimization:

$\min_{x \in \mathbb{R}^L} \|\,|Gx| - s\|_2^2$

or equivalently, in STFT-space by seeking $c \in \mathbb{C}^{M \times N}$ consistent with the signal and having prescribed magnitudes.

Griffin-Lim proceeds by alternately projecting iterates onto two constraint sets:

Magnitude set $C_2$ : All STFT matrices with $|c_{m,n}| = s_{m,n}$ .
Consistency set $C_1$ : STFTs of real signals, i.e., $c = Gx$ for some $x$ .

The projection operators are:

$P_{C_2}(c)_{m,n} = s_{m,n}\exp\bigl(i\arg c_{m,n}\bigr)$
$P_{C_1}(c) = G G^* c$

The standard iteration is:

$c^{(k+1)} = P_{C_1}\bigl(P_{C_2}(c^{(k)})\bigr)$

with arbitrary initial phase, and the time-domain estimate obtained via $x^{(k+1)} = G^* c^{(k+1)}$ (Sharma et al., 2020, Nenov et al., 2023, Nenov et al., 2023, Liu et al., 2024).

2. Algorithm Variants: Acceleration and Modern Extensions

Multiple variants improve convergence speed and quality over classical GLA:

Fast Griffin–Lim (FGLA) introduces inertial momentum:

$c^{(k)} = t^{(k)} + \alpha(t^{(k)} - t^{(k-1)}), \quad t^{(k)} = P_{C_1}(P_{C_2}(c^{(k-1)})), \quad \alpha \in [0,1)$

Empirically, FGLA achieves the same fidelity with roughly half the iterations of GLA (Sharma et al., 2020, Nenov et al., 2023, Nenov et al., 2023).

Accelerated Griffin–Lim (AGLA) generalizes FGLA by adding secondary inertial sequences and a relaxation parameter $\gamma$ :

$t_n = (1-\gamma)\,d_{n-1} + \gamma\,P_{C_1}(P_{C_2}(c_{n-1}))$

with parameter choices demonstrated to yield 3–4× speedup and lower final error (Nenov et al., 2023, Nenov et al., 2023).

Other advanced algorithms using the same projection primitives include:

Relaxed Averaged Alternating Reflections (RAAR)
Difference Map (DM) These methods, borrowed from optics, offer superior escape from nonconvex minima and are often used in early iterations before switching to FGLA for rapid final convergence (Peer et al., 2022).

Recent architectures such as Deep Griffin-Lim Iteration (DeGLI) interleave GLA layers with a residual DNN, producing high-quality phase estimates with an order of magnitude fewer iterations (Masuyama et al., 2019).

3. Online, Real-Time, and Application-Specific Griffin-Lim Variants

Originally defined as an offline (entire spectrogram) method, GLA and its variants have been adapted for real-time applications:

RTISI computes projections frame-by-frame with amplitude and STFT-consistency enforcement over fluid/frozen buffers (Peer et al., 2023).
Flexible online frameworks permit FGLA, AGLA, RAAR, DM and similar strategies to be deployed in streaming speech synthesis and enhancement.

For text-to-speech, FGLA has demonstrated strong reductions in synthesis delay and improved MOS scores relative to GLA and GAN-based vocoders (Sharma et al., 2020). GLA also serves as a core component in neural audio diffusion pipelines, with GLA-Grad correcting intermediate samples to maintain spectral consistency with the mel-spectrogram conditioning (Liu et al., 2024).

4. Error Analysis, Convergence, and Empirical Performance

GLA monotonically decreases the magnitude error objective and converges to a fixed point, but only guarantees local optimality. Both FGLA and AGLA exhibit empirical speedups: FGLA typically delivers 2× acceleration, while AGLA attains up to 3.5× speedup and lower final SSNR (Nenov et al., 2023, Nenov et al., 2023).

Per iteration, all projection-based schemes have similar computational complexity, dominated by STFT and iSTFT ( $O(N \log N)$ ). Advanced variants (RAAR, DM) incur slightly higher per-iteration cost due to multiple projections.

Empirical comparisons:

Algorithm	Iterations to 20dB SSNR	Final SSNR (1000 it)
GLA	~300	~22 dB
FGLA	~150	~24 dB
AGLA	~80	27–30 dB

Further, hybrid DM→FGLA approaches have been shown to produce highest speech quality (PESQ ~3.73) in fewer iterations (Peer et al., 2022).

5. Generalizations: Partial Phase Knowledge, Multiple Sources, and Alternative Objectives

GLA readily handles partial phase knowledge via a phase-inpainting scheme: at each projection, known phases are enforced while unknowns are iteratively updated (Krémé et al., 2018).

More generally, the multi-source Griffin–Lim algorithm (MSGLA) incorporates geometric constraints (laws of sines/cosines) for speech enhancement under noise. By alternating between GLA-style consistency enforcement and geometric phase updates (using noise magnitude or phase), MSGLA resolves ambiguity and achieves improved noise suppression and SI-SNR over classical GLA and DNN predictors (Ho et al., 2 Jul 2025).

Convex relaxation of the phase retrieval problem—e.g., STliFT—casts magnitude consistency as a trace-minimization SDP. This approach achieves higher exact-recovery rates in the noiseless regime, but is computationally demanding for long signals; segmentation and sparse structure exploitation are recommended (Sun et al., 2012).

6. Integration with Neural and GAN-Based Systems

Phase-aware music super-resolution exploits modified GLA to jointly employ GAN-predicted high-frequency magnitudes and preserved low-frequency magnitude/phase information. The iterative projection constrains the reconstructed waveform to be spectrally consistent, outperforming naive phase-flip and yielding sharper transients in audio (Hu et al., 2020).

GLA and its acceleration schemes have been integrated into deep learning models (DeGLI, GLA-Grad, hybrid schemes), offering a tradeoff between iteration count and computational cost, with quality superior to equivalent classical iterations (Masuyama et al., 2019, Liu et al., 2024).

7. Practical Guidelines, Limitations, and Tuning

Recommended hyperparameters:

Typical iteration counts: GLA 60, FGLA 30 for ~20 kHz audio, shorter for lower-dimensional inputs.
Momentum/relaxation coefficients: FGLA $\alpha \approx 0.2$ or $\alpha \approx 0.99$ for maximum acceleration; AGLA $(\alpha, \beta, \gamma) \approx (0.99, 0.95, 1.2)$ empirically best.
Initialization: zero-phase, random, or noisy phase; PGHI warm-starts can improve convergence.
Stopping: fixed iteration budget, threshold relative change.
Limitations: GLA/variants can stall in poor minima, especially with large hop sizes or low redundancy. In optics/speech, DM/RAAR or SDP-based methods sometimes outperform GLA in these cases.