AV-HuBERT: Multimodal Speech Representation

Updated 29 January 2026

AV-HuBERT is a self-supervised audio-visual framework that jointly models speech acoustics and lip movements using dual-stream encoders fused via a Transformer.
The approach employs masked prediction with k-means clustering and modality dropout, significantly reducing labeled data requirements while enhancing robustness.
Quantitative benchmarks in AVSR, lip reading, and speech enhancement confirm its state-of-the-art performance, showcasing improvements in WER, EER, and perceptual metrics.

Audio-Visual HuBERT (AV-HuBERT) is a self-supervised speech representation learning framework designed to jointly model audio and visual signals—specifically, speech acoustics and synchronous lip movements—using a masked prediction paradigm generalized from HuBERT. AV-HuBERT has become a foundational model for audio-visual automatic speech recognition (AVSR), lip reading, speaker verification, speech enhancement/separation, and more, facilitating robust multimodal speech modeling with reduced labeled data requirements. Its approach combines discrete hidden-unit discovery (via clustering) with deep Transformer-based fusion, producing contextualized joint representations effective across a range of classification and regression tasks (Shi et al., 2022).

1. Model Architecture and Training Paradigm

AV-HuBERT utilizes a dual-stream architecture: a temporal convolutional audio encoder processes raw waveforms or log-Mel features (typically at 16 kHz), while a visual encoder—based on 3D convolutional layers and a modified ResNet-18—extracts features from cropped mouth-region frames (e.g., 96×96, 25 fps) (Shi et al., 2022, López, 22 Jan 2026). The two encoder outputs are temporally synchronized (downsampled and aligned) and concatenated per frame. This fused sequence is passed into a Transformer encoder backbone (e.g., 12 layers for Base; 24 layers for Large), which models long-range, cross-modal context (Shi et al., 2022, Shahzad et al., 2023).

The self-supervised training objective is masked multimodal cluster prediction: random spans of audio, video, or both modalities are masked, and the model predicts discrete pseudo-labels (cluster assignments) for masked frames. These targets are produced by running k-means on MFCCs or intermediate features, refined iteratively across training stages (Shi et al., 2022). The loss is cross-entropy over masked time steps:

$L_{\text{mask}} = - \sum_{t\in M} \log P(c_t | x_{\text{masked}})$

where $M$ denotes masked indices and $c_t$ the target cluster (López, 22 Jan 2026).

Additional regularization includes modality dropout and “mask-by-substitution” augmentation: for some examples, only one of the two modalities is used, and splices of visual input from mismatched utterances are employed, enforcing robustness and true cross-modal learning (Shi et al., 2022, Hsu et al., 2022).

A variant using Conformer encoders (local convolution + self-attention) replaces standard Transformers and yields further performance and robustness improvements (Ren et al., 2023).

AV-HuBERT pretraining leverages large-scale unlabeled datasets (e.g., LRS3, VoxCeleb2) (Shi et al., 2022), extracting both audio and aligned video. Pretraining progresses through multiple iterations: each stage produces cluster assignments via k-means on the output of a selected Transformer layer from the previous stage. The number of clusters typically grows across iterations (e.g., 100→1000→2000), and the cluster targets transition from purely audio-based to joint audiovisual features, increasing the task’s granularity (Shi et al., 2022).

Modality dropout is used during training, sometimes presenting only one modality (audio or video). This strategy is critical for robust unimodal and cross-modal transfer, and for avoiding degenerate solutions where one modality dominates. In the generalized version (u-HuBERT), this enables the same model to perform well with audio, visual, or AV input, supporting zero-shot cross-modal deployment (Hsu et al., 2022).

All self-supervised parameters are updated using Adam or AdamW optimizers, with custom schedules (warmup and decay), large batch sizes, and standard dropout/layernorm (Shi et al., 2022, Ren et al., 2023).

3. Downstream Fine-Tuning and Performance

After pretraining, AV-HuBERT can be fine-tuned for a variety of downstream tasks:

End-to-end AVSR and Lip Reading: A decoder (CTC or sequence-to-sequence) is added on top of the frozen or partially unfrozen Transformer stack, predicting phones, subword units, or words. Fine-tuning occurs with as little as 30 hours of labeled data, yet AV-HuBERT achieves or surpasses state-of-the-art (e.g., 32.5% WER for lip reading on LRS3 using 30 hours vs 33.6% SOTA with 31K hours; 1.4% WER for AVSR in clean LRS3) (Shi et al., 2022, Shi et al., 2022).
Speaker Verification: The model’s representations are either pooled in an ELMo-style weighted scheme or a CLS token is used, yielding speaker verification equal error rates (EER) of 2.4% (audio+lip, clean) and dramatically improved robustness under noise (EER drops 63% in 20 noisy conditions) (Shi et al., 2022).
Speech Enhancement/Separation: Layer-weighted AV-HuBERT embeddings are input to speech enhancement or mask prediction heads (e.g., BLSTM or attention-based modules), outperforming both unimodal and multimodal baselines on measures such as PESQ, STOI, and SI-SNR (Chern et al., 2022). For target speech extraction, pre-trained AV-HuBERT Transformer layers (sometimes just a slice) are repurposed as powerful audio-visual cue encoders (Wu et al., 2024).
Non-ASR Applications: AV-HuBERT features are leveraged for deepfake detection via audio-visual synchrony analysis (Shahzad et al., 2023), dysarthric speech reconstruction (Chen et al., 2024), and talking-face generation/evaluation, where frozen AV-HuBERT serves as both a lip-sync expert and evaluation metric generator (Yaman et al., 2024).

Empirical gains across domains include increased label efficiency (up to 10× reduction in required labeled data), higher intelligibility and naturalness for disordered speech, and improved lip-sync metrics and robustness in both classification and regression tasks (Shi et al., 2022, Chern et al., 2022, Yaman et al., 2024, Chen et al., 2024).

4. Quantitative Analysis and Perceptual Benchmarks

AV-HuBERT’s multisensory integration properties have been benchmarked against human perception. For McGurk-style incongruent AV fusion, AV-HuBERT and humans display closely matched auditory-dominance rates (32.0% for AV-HuBERT, 31.8% for humans), but the model shows a deterministic bias with higher phonetic fusion rates (68.0% vs. 47.7% human), indicating categorical output with negligible entropy and no “other” responses, in contrast to the stochastic error profiles of human listeners (López, 22 Jan 2026).

In temporal analysis, AV-HuBERT’s phoneme information emerges only about 20 ms earlier than in the audio-only HuBERT, attributed to audio-frame stacking and fusion stride, but it fails to capture natural human-like AV asynchrony (humans exploit 100–300 ms early visual information in speech) (Wang et al., 25 Jun 2025). This suggests a limitation in biofidelity for modeling temporal dynamics.

5. Extensions, Variants, and Specializations

Several architectural variants and adaptations have been developed:

Conformer-based AV-HuBERT: Replacing the Transformer with Conformer blocks (self-attention + convolution) offers improved robustness and context modeling, lowering WER/CER in both English and Mandarin and under noise (Ren et al., 2023).
Unified-Mixture Modalities: u-HuBERT extends AV-HuBERT to unified, mixed-modal pretraining, enabling zero-shot generalization across audio, visual, and mixed inputs with high accuracy (Hsu et al., 2022).
Non-AVSR Applications: AV-HuBERT embeddings are adapted for deepfake detection (AV-Lip-Sync+), target speech extraction, and talking-face generation. In speech enhancement/separation and dysarthric speech reconstruction, learned AV-HuBERT representations are fine-tuned and repurposed, transferring robust crossmodal speech cues (Chern et al., 2022, Wu et al., 2024, Chen et al., 2024).
Evaluation Metrics and Lip Sync: AV-HuBERT outputs are used to define new synchronization metrics (e.g., AVS_u, AVS_m, AVS_v) for both unsupervised and supervised assessment in talking-face generation (Yaman et al., 2024).

6. Limitations, Biofidelity, and Open Research Questions

AV-HuBERT’s self-supervised training enables robust integration of audio-visual cues, but several limitations persist:

Determinism vs. Stochasticity: AV-HuBERT displays rigid, high-confidence (low entropy) integration, lacking neural variability or the psychophysical diversity observed in human listeners (López, 22 Jan 2026).
Temporal Dynamics: The model undercaptures natural AV asynchrony, likely due to frame alignment strategies and audio-heavy cluster labeling, resulting in only a superficial lead of visual predictions over audio (Wang et al., 25 Jun 2025).
Requirements for Human-Like Integration: Potential directions include injecting stochasticity into attention layers, variational/bayesian decoding, explicit noise at the fusion layers, uncertainty-modulated modality weighting, semi-supervised or multi-task predictive objectives for better asynchrony modeling, and multi-rate or cross-modal Transformer designs (López, 22 Jan 2026, Wang et al., 25 Jun 2025).

Some domain adaptation challenges remain for deployment across less standardized speaker populations, non-English languages, and non-frontal or low-quality visual streams (Shi et al., 2022, Chern et al., 2022).

7. Applications and Impact across Domains

AV-HuBERT has advanced performance and label efficiency across core tasks:

Application	Task Type	Quantitative Advance (Selected)
AVSR	Sequence labeling	1.4% WER (433 h) (Shi et al., 2022), robust under noise
Lip Reading	Sequence labeling	26.9% WER (433 h, self-trained) (Shi et al., 2022)
Speaker Verification	Embedding	2.4% EER (noise-robust) (Shi et al., 2022)
Speech Enhancement/SS	Regression	+0.06 PESQ, +0.5 dB SI-SNR vs. prior (Chern et al., 2022)
Deepfake Detection	Classification	SOTA on FakeAVCeleb/DeepfakeTIMIT (Shahzad et al., 2023)
Dysarthric Rec.	Regression	–8.5% PER, –8.2% WER (UASpeech) (Chen et al., 2024)
Talking Face Gen/Eval	Synchronization	AVS_u↑ 0.508 vs 0.301 baseline (Yaman et al., 2024)

AV-HuBERT’s pre-trained models, open-source implementations, and benchmark results have made it a standard for multimodal speech modeling. It acts as both a backbone for sequence models and a transferable feature extractor for crossmodal regression, classification, and evaluation pipelines.

AV-HuBERT thus exemplifies a paradigm shift in multimodal speech modeling: large-scale self-supervision, cluster prediction, and Transformer fusion enable robust, efficient, and generalizable representations that underpin leading approaches in audio-visual recognition, enhancement, speaker analysis, clinical speech restoration, and synchrony evaluation (Shi et al., 2022, Shi et al., 2022, Chern et al., 2022, Chen et al., 2024). Limitations relating to temporal alignment, biofidelity, and neural stochasticity remain open for future work.