FunCineForge: Zero-Shot Multimodal Movie Dubbing

Updated 28 January 2026

FunCineForge is a unified framework for zero-shot movie dubbing that integrates automated dataset construction with multimodal language model architectures.
It employs full-face cues, explicit speaker switching, and chain-of-thought corrections to achieve robust lip-sync, timbre transfer, and emotional expressiveness.
The system overcomes data scarcity and noisy annotations, delivering high-fidelity dubbing across monologue, dialogue, and multi-speaker cinematic scenes.

FunCineForge is a unified framework for zero-shot movie dubbing, coupling an automated large-scale dataset construction pipeline (CineDub-CN) with a multimodal LLM (MLLM)-based dubbing architecture. The system is designed to address limitations of prior data scarcity, noisy annotations, and model over-reliance on lip-only cues, enabling high-fidelity dubbing across diverse cinematic scenarios, including monologue, narration, dialogue, and multi-speaker scenes. FunCineForge emphasizes precise lip-sync, robust timbre transfer—including explicit speaker switching—and emotional expressiveness, especially under naturalistic, live-action scene conditions (Liu et al., 21 Jan 2026).

1. Challenges in Multimodal Movie Dubbing

Prior approaches in movie dubbing encountered two core obstacles: the scarcity and noise of high-quality multimodal datasets, and SOTA dubbing models’ over-reliance on lip region cues. Existing datasets were typically monologue-only, limited to at most 10,000 clips (≈7 hours), manually annotated, and characterized by high ASR word error rates. Current models performed audio-visual alignment using only cropped lip images, rendering them ineffective in scenes involving occlusion, rapid shot changes, multi-speaker exchanges, or low-resolution frames.

These deficiencies stymied the development of dubbing models capable of simultaneously achieving accurate lip sync, high-quality speech, speaker identity preservation, and emotional naturalness across the complexity of cinematic content (Liu et al., 21 Jan 2026).

2. Automated Dataset Construction: CineDub-CN

FunCineForge introduces an end-to-end, fully automated pipeline to convert long-form TV episodes into richly annotated, structured multimodal dubbing datasets. The process incorporates:

Video Standardization and Segmentation: Detection of speech-active regions using FSMN-Monophone VAD, transcription using FunASR, and segmentation into sentence-level SRT-aligned clips (~11 seconds average).
Vocal Separation and Filtering: Mel-RoFormer separates vocals from background, with overlap-speech detection ensuring single-speaker purity.
Audio-Visual Speaker Diarization: Extraction of audio embeddings (CAM++), 25 Hz speech tokenization (CosyVoice 3), periodic video frame sampling, face/lip detection (TalkNet-ASD, CurricularFace, HPMDubber), and clustering (as in 3D-Speaker) yield precise time-aligned RTTM speaker labels.
Multimodal Chain-of-Thought (CoT) Correction: Gemini-2.5-Pro processes clean audio, transcripts, and diarization tuples, rectifies ASR/diarization errors, and infers high-level “clue” annotations (character age, gender, timbre, emotion). Clips are filtered for extreme ASR corrections ( $d_\mathrm{lev}(\mathrm{ASR},\mathrm{Corrected}) / |\mathrm{ASR}| > 0.5$ ) and diarization inconsistencies.

The result is CineDub-CN: derived from ≈200 series, covering >6,000 hours (7.2 TB), yielding 1,559,172 clips and ~4,700 hours of speech, with substantial representation of non-neutral emotions (41.8%) and scene diversity (Liu et al., 21 Jan 2026).

3. Model Architecture: Multimodal Alignment and Flow Matching

FunCineForge’s dubbing model comprises two principal components:

A. MLLM with Multimodal Alignment:

Inputs integrate face frames (downsampled from 25 to 5 fps), lip crops, script with “clue” text (defining speaker attributes and emotional tone), scene type encoding, and timestamp–speaker tuples ( $\mathcal{T}^N$ ). Dedicated encoders (CurricularFace for faces, HPMDubber for lips, BPE tokenizers for text, custom TST for timestamps) extract paired representations. Supervision is threefold: 1. Voice activity loss: $L_{VA} = -\sum_t [\tau_t \log \hat{\tau}_t + (1-\tau_t)\log (1-\hat{\tau}_t)]$ 2. Speech token loss: $L_{ST} = -\frac{1}{T+1}\sum_t \log p(X_{\text{Speech}}^{(t)} | X_{\text{Speech}}^{(<t)}, C_{LM})$ 3. Contrastive lip loss: $L_\text{Lip} = -\sum_t \tau_t w_t \log \frac{\exp(\langle E_\text{Lip}^{(t)}, E_\text{ST}^{(t)} \rangle / \tau)}{\sum_s \exp(\langle E_\text{Lip}^{(t)}, E_\text{ST}^{(s)} \rangle / \tau)}$

B. Flow Matching with Explicit Speaker Switching:

Utilizing a CosyVoice 3-based DiT diffusion transformer, the system extracts reference speaker embeddings (CAM++), and inserts speaker tokens at segment boundaries, enabling causal speaker switching. The conditional flow matching loss:

$L_\text{Flow}(\theta) = \mathbb{E}_{u\sim U[0,1], Y_1 \sim \mathcal{N}(0,I)} \Vert \text{DiT}_\theta(Y_u, u | C_\text{flow}) - (Y_1 - Y_0) \Vert_2^2$

where $Y_u = (1-u)Y_0 + uY_1$ and $C_\text{flow}$ concatenates text and speaker embeddings.

Mel-spectrogram outputs are vocoded to waveform with HiFiGAN. Aggregate loss is $L_{VA} + L_{ST} + L_\text{Lip} + L_\text{Flow}$ . Training uses AdamW (peak LR $1 \times 10^{-4}$ ), batch size 20k tokens, on 8 × A100 GPUs, with CosyVoice3-0.5B initialization and ≥20 epochs on CineDub-CN (Liu et al., 21 Jan 2026).

4. Experimental Protocols and Performance Metrics

Evaluation spans monologue, narration, dialogue, and multi-speaker scenes. Metrics include:

Audio Quality: MCD-DTW/MCD-DTW-SL (lower is better), UTMOS (higher is better).
Pronunciation: CER, WER (Whisper-Large-v3).
Lip-sync: LSE-D (error, lower is better), LSE-C (confidence, higher is better; via SyncNet).
Speaker: SPK-TL (timing/leakage, lower is better), SPK-SIM (cosine similarity), EMO-SIM (emotion2vec cosine), ES-MOS (human rating).

Excerpted Quantitative Results

Scene / Model	MCD-DTW↓	UTMOS↑	LSE-C↑	LSE-D↓	SPK-TL↓	SPK-SIM↑	EMO-SIM↑	ES-MOS↑
FunCineForge (monologue)	4.52	3.98	8.72	3.82	0.088	76.50%	74.50%	3.80
InstructDubber	5.05	3.82	8.08	7.93	0.156	74.53%	72.86%	3.83
Speaker2Dubber (V2C+Chem+GRID)	9.80	3.42	5.63	12.58	0.307	63.05%	46.33%	—

FunCineForge demonstrates approximately 1.0 dB lower MCD-DTW, +0.18 UTMOS, and half the LSE-D of InstructDubber for monologue scenes, in addition to ∼44% reduction in SPK-TL and ∼2% improvement in SPK-SIM. High alignment and timbre integrity are maintained in dialogue and multi-speaker conditions.

5. Qualitative Analysis

FunCineForge outperforms lip-only approaches when faced with occlusion and shot changes, using full-face cues and timestamp supervision. In multi-speaker, fast-switch contexts, SPK-TL remains ≤0.09, indicating minimal truncation/leakage. Emotion transfer is robust, with high ES-MOS scores (e.g., 4.03 in high-emotion scenes versus ∼3.7 for baselines). Explicit speaker switching yields sharp timbre transitions at dialogue boundaries, minimizing leakage, as visualized in supplementary demos (Liu et al., 21 Jan 2026).

6. Conclusions and Future Directions

FunCineForge delivers the first large-scale, expertly annotated Chinese television dubbing dataset and an MLLM-based model with explicit multimodal, temporal, and speaker supervision. It empirically surpasses previous SOTA in audio quality, lip-sync, timbre consistency, and emotional transfer across complex cinematic settings. Identified limitations include the need to extend dataset coverage to multilingual and cross-cultural resources, to investigate models for joint video-to-speech and autonomous video animation, and to explore efficient, on-device real-time dubbing architectures (Liu et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FunCineForge.