Griffin-Lim Phase Estimation
- The Griffin-Lim Phase Estimation Algorithm is an iterative method that reconstructs time-domain signals from magnitude-only STFT representations using alternating projections.
- Accelerated variants like FGLA and AGLA reduce iterations by up to 3.5× while preserving or enhancing signal quality in audio applications.
- Real-time implementations and deep learning integrations have extended its use in speech synthesis, audio restoration, and music enhancement.
The Griffin-Lim Phase Estimation Algorithm is a foundational iterative scheme for reconstructing time-domain signals from magnitude-only short-time Fourier transform (STFT) representations. It has underpinned developments across speech synthesis, audio restoration, music enhancement, and phase retrieval, with a range of improvements and adaptations utilizing its alternating projection principle.
1. Mathematical Formulation of Phase Retrieval and the Griffin-Lim Algorithm
The core phase-retrieval problem is, given a magnitude spectrogram , to find a time-domain signal such that , where is the STFT operator. This is typically cast as a nonconvex least-squares minimization:
or equivalently, in STFT-space by seeking consistent with the signal and having prescribed magnitudes.
Griffin-Lim proceeds by alternately projecting iterates onto two constraint sets:
- Magnitude set : All STFT matrices with .
- Consistency set : STFTs of real signals, i.e., for some .
The projection operators are:
The standard iteration is:
with arbitrary initial phase, and the time-domain estimate obtained via (Sharma et al., 2020, Nenov et al., 2023, Nenov et al., 2023, Liu et al., 2024).
2. Algorithm Variants: Acceleration and Modern Extensions
Multiple variants improve convergence speed and quality over classical GLA:
- Fast Griffin–Lim (FGLA) introduces inertial momentum:
Empirically, FGLA achieves the same fidelity with roughly half the iterations of GLA (Sharma et al., 2020, Nenov et al., 2023, Nenov et al., 2023).
- Accelerated Griffin–Lim (AGLA) generalizes FGLA by adding secondary inertial sequences and a relaxation parameter :
with parameter choices demonstrated to yield 3–4× speedup and lower final error (Nenov et al., 2023, Nenov et al., 2023).
Other advanced algorithms using the same projection primitives include:
- Relaxed Averaged Alternating Reflections (RAAR)
- Difference Map (DM) These methods, borrowed from optics, offer superior escape from nonconvex minima and are often used in early iterations before switching to FGLA for rapid final convergence (Peer et al., 2022).
Recent architectures such as Deep Griffin-Lim Iteration (DeGLI) interleave GLA layers with a residual DNN, producing high-quality phase estimates with an order of magnitude fewer iterations (Masuyama et al., 2019).
3. Online, Real-Time, and Application-Specific Griffin-Lim Variants
Originally defined as an offline (entire spectrogram) method, GLA and its variants have been adapted for real-time applications:
- RTISI computes projections frame-by-frame with amplitude and STFT-consistency enforcement over fluid/frozen buffers (Peer et al., 2023).
- Flexible online frameworks permit FGLA, AGLA, RAAR, DM and similar strategies to be deployed in streaming speech synthesis and enhancement.
For text-to-speech, FGLA has demonstrated strong reductions in synthesis delay and improved MOS scores relative to GLA and GAN-based vocoders (Sharma et al., 2020). GLA also serves as a core component in neural audio diffusion pipelines, with GLA-Grad correcting intermediate samples to maintain spectral consistency with the mel-spectrogram conditioning (Liu et al., 2024).
4. Error Analysis, Convergence, and Empirical Performance
GLA monotonically decreases the magnitude error objective and converges to a fixed point, but only guarantees local optimality. Both FGLA and AGLA exhibit empirical speedups: FGLA typically delivers 2× acceleration, while AGLA attains up to 3.5× speedup and lower final SSNR (Nenov et al., 2023, Nenov et al., 2023).
Per iteration, all projection-based schemes have similar computational complexity, dominated by STFT and iSTFT (). Advanced variants (RAAR, DM) incur slightly higher per-iteration cost due to multiple projections.
Empirical comparisons:
| Algorithm | Iterations to 20dB SSNR | Final SSNR (1000 it) |
|---|---|---|
| GLA | ~300 | ~22 dB |
| FGLA | ~150 | ~24 dB |
| AGLA | ~80 | 27–30 dB |
Further, hybrid DM→FGLA approaches have been shown to produce highest speech quality (PESQ ~3.73) in fewer iterations (Peer et al., 2022).
5. Generalizations: Partial Phase Knowledge, Multiple Sources, and Alternative Objectives
GLA readily handles partial phase knowledge via a phase-inpainting scheme: at each projection, known phases are enforced while unknowns are iteratively updated (Krémé et al., 2018).
More generally, the multi-source Griffin–Lim algorithm (MSGLA) incorporates geometric constraints (laws of sines/cosines) for speech enhancement under noise. By alternating between GLA-style consistency enforcement and geometric phase updates (using noise magnitude or phase), MSGLA resolves ambiguity and achieves improved noise suppression and SI-SNR over classical GLA and DNN predictors (Ho et al., 2 Jul 2025).
Convex relaxation of the phase retrieval problem—e.g., STliFT—casts magnitude consistency as a trace-minimization SDP. This approach achieves higher exact-recovery rates in the noiseless regime, but is computationally demanding for long signals; segmentation and sparse structure exploitation are recommended (Sun et al., 2012).
6. Integration with Neural and GAN-Based Systems
Phase-aware music super-resolution exploits modified GLA to jointly employ GAN-predicted high-frequency magnitudes and preserved low-frequency magnitude/phase information. The iterative projection constrains the reconstructed waveform to be spectrally consistent, outperforming naive phase-flip and yielding sharper transients in audio (Hu et al., 2020).
GLA and its acceleration schemes have been integrated into deep learning models (DeGLI, GLA-Grad, hybrid schemes), offering a tradeoff between iteration count and computational cost, with quality superior to equivalent classical iterations (Masuyama et al., 2019, Liu et al., 2024).
7. Practical Guidelines, Limitations, and Tuning
Recommended hyperparameters:
- Typical iteration counts: GLA 60, FGLA 30 for ~20 kHz audio, shorter for lower-dimensional inputs.
- Momentum/relaxation coefficients: FGLA or for maximum acceleration; AGLA empirically best.
- Initialization: zero-phase, random, or noisy phase; PGHI warm-starts can improve convergence.
- Stopping: fixed iteration budget, threshold relative change.
- Limitations: GLA/variants can stall in poor minima, especially with large hop sizes or low redundancy. In optics/speech, DM/RAAR or SDP-based methods sometimes outperform GLA in these cases.
A flexible online framework permits real-time implementation of any projection-based variant with similar per-frame complexity (Peer et al., 2023).
The Griffin-Lim algorithm and its extensions remain central in phase retrieval, serving both as reliable baselines and as the projection component in state-of-the-art neural architectures and real-time vocoders for audio, speech, and music signal reconstruction (Sharma et al., 2020, Peer et al., 2022, Nenov et al., 2023, Nenov et al., 2023, Ho et al., 2 Jul 2025, Masuyama et al., 2019, Hu et al., 2020, Liu et al., 2024, Peer et al., 2023, Sun et al., 2012, Krémé et al., 2018).