Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

Published 24 Sep 2025 in eess.AS and cs.SD | (2509.19881v2)

Abstract: Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

Summary

  • The paper presents a novel coarse-to-fine masking strategy in a generative model that refines speech tokens for improved clarity and reduced error.
  • It integrates BigCodec with a TF-GridNet encoder and a BLSTM corrector to balance token prediction and computational efficiency.
  • Results demonstrate superior noise reduction, enhanced WER performance, and lower computational requirements on diverse noisy datasets.

MAGE: Coarse-to-Fine Speech Enhancer with Masked Generative Model

Introduction

MAGE introduces an innovative approach to speech enhancement by employing a masked generative model with a coarse-to-fine masking strategy. The primary focus is achieving efficient enhancement while maintaining perceptual quality, a significant challenge in the field due to the inherent trade-offs between computational efficiency and speech clarity. Prior approaches often rely on random masking strategies, leading to inefficiencies and redundancy, motivating the need for the more strategic masking implemented in MAGE.

MAGE is built upon the generative capabilities of Qwen2.5-0.5B, a powerful LLM, and BigCodec, which together enable the conversion of audio signals into discrete representations for effective enhancement. Figure 1

Figure 1: A comparison of DNS OVL scores on a real-world test set against model size for several different methods.

Neural Encodec and Encoders

The foundation of MAGE lies in extracting robust discrete representations using BigCodec, which achieves stable tokenization through 80 tokens per second. The distorted audio is processed into complex spectrograms that interact with a TF-GridNet block, projecting the spectrogram into a conditioning embedding space. Additionally, speaker identity is extracted employing a pretrained speaker encoder, ensuring the model captures acoustic characteristics vital for personalized enhancement.

Mask Generative Model (MGM)

MAGE employs a Mask Generative Model (MGM) that applies a token masking strategy over multiple denoising steps, transforming the token sequence step-by-step. The masking process follows a scarcity-aware strategy where frequent tokens are prioritized in early steps, and rarer tokens are refined later. This approach is governed by probabilities scheduled via a cosine distribution, reinforcing the model's learning efficiency. Figure 2

Figure 2: Training pipeline and model design of MAGE. Target audio is first converted into sequence of tokens using a Neural Encodec. These tokens are then masked according to their distribution to form a coarse-to-fine masking strategy.

Coarse-to-Fine Strategy and Corrector

The coarse-to-fine (CTF) masking strategy significantly improves model generalization by emphasizing less frequent tokens’ prediction reliability. It computes a frequency vector, adjusting masking probabilities using an IDF-like scoring system. The result is a masking approach that naturally progresses from frequent to rare token prediction, ensuring balanced learning.

Additionally, MAGE incorporates a lightweight corrector module designed to refine predictions further. This module employs a BLSTM architecture to detect and revise low-confidence tokens through iterative refinement, enhancing perceptual quality and minimizing error accumulation. Figure 3

Figure 3: Ablation study on the number of inference steps for CTF and CTF + Corrector. The overall performance is measured using DNSMOS-OVL on Real Recording DNS dataset.

Experimental Setup

MAGE was trained using a reduced Qwen2.5-0.5B model, optimized with LoRA techniques to balance size and performance, achieving a compact size of 200M parameters. Training involved diverse datasets, including LibriSpeech and DNS Challenge, augmented with various noise types and reverberation effects, to simulate realistic audio conditions. The evaluation covered common metrics such as SIG, BAK, OVL, SSIM, and impact on downstream ASR tasks with WER calculations.

Results and Analysis

The results on the DNS Challenge reveal MAGE’s competitiveness against larger models, providing superior SIG scores and robust OVL performance across varied acoustic conditions. Particularly noteworthy is its efficiency advantage, achieving comparable perceptual quality with significantly lower computational requirements.

MAGE has demonstrated improvements in WER on the noisy LibriSpeech dataset, proving the value of enhanced audio for speech recognition tasks. Through ablation studies, MAGE's band-aware TF-GridNet encoder showed a balance between computational efficiency and perceptual quality, underscoring the benefits of modeling cross-band spectral dependencies.

Conclusion

MAGE sets forward a compact, efficient solution to speech enhancement, successfully balancing perceptual quality and computational demands. The innovative CTF and corrector strategies establish MAGE as a robust generative model suitable for real-world applications. Future research directions could explore multilingual applications, joint training with ASR/TTS models, and further integration with multimodal inputs, extending MAGE's utility and adaptability.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

MAGE: What this paper is about

This paper introduces MAGE, a smart tool that cleans up noisy speech so it sounds clearer to people and to machines. It uses a “generative” approach, which means it learns what clean speech should look like and then tries to rebuild it from a messy recording. The big idea is to make this process both high-quality and efficient, so it works well in real life, not just in labs.

The main goals and questions

The researchers wanted to solve three main problems:

  • How can we make speech enhancement sound natural and clear, not robotic or muffled?
  • How can we make it fast enough and small enough to run on normal hardware?
  • How can we avoid the common mistakes that happen when models try to fix noise in different real-world situations?

They asked: Can a smarter “masking” strategy (deciding which parts to guess and when) and a simple “corrector” (a checker that fixes low-confidence guesses) improve quality and speed at the same time?

How MAGE works, in simple terms

Think of cleaning up noisy speech like fixing a blurry, puzzle-like picture:

  • First, the audio is turned into small building blocks called “tokens” (like Lego pieces). A speech codec called BigCodec does this, creating about 80 tokens per second.
  • MAGE is a language-model-like system that learns how these tokens should look for clean speech.
  • It uses “masking,” which is like covering certain puzzle pieces and asking the model to guess them using the surrounding context.

Here are the key ideas explained with everyday analogies:

  • Masked generative model: Imagine a fill-in-the-blanks game for sound. The model hides some pieces and predicts what they should be to make the speech clean.
  • Coarse-to-fine (CTF) masking: Instead of hiding tokens randomly, MAGE starts by fixing the most common, easy parts first, then tackles rare, tricky parts later. It’s like painting the background of a picture first, then adding the fine details. This saves time and helps the model generalize to new types of noise.
  • Corrector module: After the model makes predictions, a lightweight “proofreader” checks for low-confidence pieces and asks the model to re-try only those parts. This prevents errors from piling up and improves the final sound.
  • Band-aware encoder (TF-GridNet): This component looks at different frequency bands (think bass vs. treble) and how they interact, helping the model understand both the voice and the noise more efficiently.
  • Small but smart: MAGE starts with a general-purpose LLM (Qwen2.5-0.5B) and trims it down to about 200 million parameters by keeping only selected layers. It uses LoRA (a way to fine-tune big models cheaply) to train on a single GPU.

What they found and why it matters

The team tested MAGE on two major benchmarks:

  • DNS Challenge: A widely used test set for noisy and real recordings.
  • Noisy LibriSpeech: Clean audiobooks mixed with different types of noise.

They measured:

  • SIG (signal quality), BAK (background noise), and OVL (overall quality), from DNSMOS.
  • WER (Word Error Rate), which shows how well an automatic speech recognizer understands the enhanced audio.

Main results:

  • MAGE achieves state-of-the-art perceptual quality compared to bigger, more expensive models.
  • The CTF strategy clearly improves overall quality, especially on real recordings and non-reverberant audio.
  • The corrector makes the system more stable during inference (prediction time), further boosting scores.
  • On Noisy LibriSpeech, MAGE reduces WER to about 23.45%, better than other top generative methods (lower WER means the speech is easier for machines to understand).

Why this matters:

  • People get cleaner, more natural-sounding speech in everyday environments like homes, streets, or offices.
  • Machines (like voice assistants, captioning systems, or transcription tools) make fewer mistakes because the speech is clearer.
  • The model is compact and efficient, making it practical for real-world apps and devices.

What this could change in the future

This research shows that you don’t need massive models or slow methods to get great audio quality. By:

  • Fixing common parts first, then refining rare details,
  • And adding a simple “proofreader” to catch mistakes,

MAGE bridges the gap between high-quality sound and practical speed. This can impact:

  • Better phone calls and online meetings,
  • More accurate voice assistants and transcription,
  • Easier data collection for building speech systems,
  • Future multilingual and streaming enhancements,
  • Joint systems that combine enhancement with speech recognition and text-to-speech.

In short, MAGE is a smart, efficient way to clean up speech that helps both humans and machines understand audio better.

Glossary

  • ASR (Automatic Speech Recognition): A system that transcribes spoken audio into text; used for evaluating intelligibility improvements. "We also use the released ASR model to compute WER, reflecting intelligibility gains."
  • BAK (Background Intrusiveness): A DNSMOS metric indicating how intrusive background noise is in the enhanced audio. "SIG (signal distortion), BAK (background intrusiveness), and OVL (overall quality) from ITU-T P.835"
  • Band-Aware Encoder: An encoder that models frequency bands explicitly to capture cross-band spectral dependencies. "Besides, speaker identity is extracted by a Band-Aware Encoder and a pretrained Speaker Encoder, enabling it to capture the acoustic characteristics."
  • BigCodec: A neural speech codec providing discrete tokens and high-quality reconstruction for audio language modeling. "we adopt BigCodec, which provides stable tokenization and high-quality reconstruction using a single codebook with 80 tokens per second."
  • BLSTM (Bidirectional Long Short-Term Memory): A recurrent neural network variant that processes sequences in both forward and backward directions. "we train a lightweight 4-layer BLSTM corrector that identifies low-confidence tokens and re-masks them."
  • Codebook: The discrete set of indices used by a neural codec to represent audio tokens. "using a single codebook with 80 tokens per second."
  • Coarse-to-Fine (CTF) masking strategy: A scarcity-aware schedule that masks frequent tokens earlier and rare tokens later to improve efficiency and generalization. "we introduce a coarse-to-fine (CTF) masking strategy."
  • Corrector module: A lightweight post-processing component that detects low-confidence tokens and re-masks them for iterative refinement. "We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement."
  • Cosine schedule: A masking probability schedule based on a cosine function used across denoising steps. "where p(i)p(i) follows a cosine schedule~\cite{zhang2025anyenhanceunifiedgenerativemodel,li2024MaskSR}."
  • DNS Challenge: A benchmark dataset and evaluation framework for noise suppression and speech enhancement. "Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines."
  • DNSMOS: A non-intrusive objective metric suite (SIG/BAK/OVL) that estimates perceptual speech quality. "DNSMOS scores (SIG/BAK/OVL) and speaker cosine similarity (SSIM) are reported."
  • Document frequency: The count of how often a token appears across the training corpus, used to compute rarity for masking. "where fif^i is the document frequency~\cite{tf-idf} of token xix^i in the training corpus."
  • HuBERT: A self-supervised speech representation model used as a strong but compute-intensive encoder baseline. "SSL models, such as HuBERT, deliver strong performance but are computationally intensive."
  • IDF-like score (Inverse Document Frequency-like): A rarity measure computed from document frequency to prioritize masking of rare tokens later. "We then calculate an IDF-like score:"
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method for LLMs. "We finetune the masked LLM from Qwen2.5-0.5B using LoRA~\cite{hu2022lora}."
  • Mask Generative Model (MGM): A generative framework that reconstructs masked tokens iteratively, conditioned on context and acoustic information. "the MGM learns to reconstruct masked tokens over NN denoising steps."
  • Neural Encodec: A neural audio compression model that converts target audio into discrete tokens for modeling. "Target audio is first converted into sequence of tokens using a Neural Encodec."
  • Non-autoregressive mode (attention): An attention configuration that predicts tokens without relying on past outputs in sequence order. "attention is configured in a non-autoregressive mode."
  • OVL (Overall Quality): A DNSMOS metric capturing the overall perceptual quality of the enhanced audio. "SIG (signal distortion), BAK (background intrusiveness), and OVL (overall quality) from ITU-T P.835"
  • PESQ (Perceptual Evaluation of Speech Quality): An intrusive metric assessing speech quality by comparing enhanced audio to clean references. "Discriminative models directly map noisy inputs to clean signals by optimizing losses aligned with intrusive metrics such as SI-SDR, PESQ, and STOI."
  • Qwen2.5-0.5B: A LLM backbone from which MAGE is finetuned and selectively pruned for efficiency. "The LLM is a reduced Qwen2.5-0.5B, where only the odd-numbered layers are kept (layer 1 is the first), and attention is configured in a non-autoregressive mode."
  • Short-time Fourier Transform (STFT): A time-frequency analysis method used to obtain complex spectrograms for conditioning. "The MAGE speech encoder applies Short-time Fourier Transform with n_fft=256n\_fft=256, window=256window=256, hop_size=100hop\_size=100"
  • SIG (Signal Distortion): A DNSMOS metric reflecting the amount of distortion introduced to the speech signal. "SIG (signal distortion), BAK (background intrusiveness), and OVL (overall quality) from ITU-T P.835"
  • Speaker embedding: A vector representation of speaker identity used to condition the generative model. "concatenated with the speaker embedding, which is projected from ee using a lightweight adaptor."
  • Speaker Similarity (SSIM): A cosine-similarity-based measure comparing speaker embeddings to assess identity preservation. "Speaker Similarity (SSIM), computed as the cosine similarity between speaker embeddings from Wespeaker \cite{wang2023wespeaker}."
  • STOI (Short-Time Objective Intelligibility): An intrusive intelligibility metric estimating how understandable speech is after enhancement. "Discriminative models directly map noisy inputs to clean signals by optimizing losses aligned with intrusive metrics such as SI-SDR, PESQ, and STOI."
  • TF-GridNet: A lightweight architecture that models full- and sub-band frequency interactions for conditioning. "we leverage a TF-GridNet block~\cite{ZhongQiuWang2023TF_gridnet}, which efficiently models cross-band frequency interactions while remaining lightweight."
  • WavLM: A large-scale self-supervised model providing discrete token representations for language-model-driven speech enhancement. "Discrete representations facilitate language-model-driven SE, as demonstrated by SELM~\cite{wang2024selm}, which leverages WavLM~\cite{Chen2022WavLM}, and MaskSR~\cite{li2024MaskSR}"
  • Word Error Rate (WER): A standard ASR metric measuring transcription error rate, used to quantify intelligibility. "We also use the released ASR model to compute WER, reflecting intelligibility gains."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.