CineDub-CN: Chinese TV Dubbing Dataset

Updated 28 January 2026

The Chinese Television Dubbing Dataset (CineDub-CN) is a large-scale, richly annotated corpus derived from over 6,000 hours of TV content using a four-stage automated pipeline.
It provides comprehensive multimodal annotations including timestamped transcripts, speaker diarization, and visual lip-sync data to support advanced research in dubbing and emotion transfer.
Rigorous quality controls and detailed paralinguistic metadata make CineDub-CN an optimal resource for breakthroughs in automated movie dubbing and zero-shot multi-speaker alignment.

The Chinese Television Dubbing Dataset refers to CineDub-CN, a large-scale, richly annotated corpus constructed through FunCineForge, an end-to-end automated pipeline for generating high-quality multimodal dubbing data from Chinese television series. This corpus is designed to facilitate advanced research in speech synthesis, audio-visual alignment, speaker identity modeling, and emotion transfer in zero-shot movie dubbing across complex, multi-speaker cinematic scenes (Liu et al., 21 Jan 2026).

1. End-to-End Data Production Pipeline

The CineDub-CN dataset is generated via a fully automated four-stage workflow that transforms over 6,000 hours of raw television footage into structured and annotated audio-visual clips. The pipeline stages are as follows:

Video Standardization and Segmentation: All raw series are re-encoded to MP4 with standardized codecs, removing opening and ending credits. Speech-active regions are identified using a long-sequence–optimized FSMN-Monophone VAD. These intervals are fed into FunASR for automatic speech recognition, and transcripts are segmented at punctuation boundaries to create sentence-level SRT files with timestamps. Clips are trimmed to a maximum of 60 seconds.
Vocal Separation and Overlap Filtering: Mel-RoFormer separates vocal and instrumental tracks in the audio. An overlapping-speech detector discards clips with concurrent multi-party speech, ensuring downstream diarization only processes single-speaker intervals.
Audio-Visual Speaker Diarization: CAM++ generates speaker embeddings at 25 Hz, while CosyVoice 3 tokenizes clean speech. On the video side, frames sampled at 5 fps are analyzed: faces are detected (using a lightweight detector), scored for speaking activity by TalkNet-ASD, and processed by a FAN-based keypoint localizer for lip/face crops. CurricularFace computes normalized face embeddings. These modalities are fused via a 3D-Speaker inspired joint clustering to yield RTTM diarization files labeled with speaker, gender, and age bracket.
Multimodal Chain-of-Thought Correction: Gemini 2.5-Pro, a multimodal LLM, performs chain-of-thought correction on transcripts and diarization tuples, reconciling mis-segmentations, duplicate identities, and extracting paralinguistic attributes such as timbre and emotion. A bidirectional verification step filters out clips with excessive transcript errors (Levenshtein edit distance > 50%) or unsolvable diarization conflicts. Clips are labeled by scene type based on speaker count and visible face data.

2. Corpus Scale and Statistical Breakdown

CineDub-CN comprises 1,559,172 clips, amounting to 7.2 TB of data and more than 4,700 hours of effective speech after filtering. Key statistical properties include:

Average Clip Length: 11.02 seconds (with variability up to 60 seconds).
Emotional Speech: 41.8% of clips contain non-neutral emotional speech.
Instruction Text: Average “clue”/instruction length is 62 words (variance 25).
Scene-Type Distribution: Approximately 22% monologue, 18% narration, 35% dialogue, and 25% multi-speaker scenes.

Speaker demographic coverage is broad, with clustering identifying five age groups (child, teenager, adult, middle-aged, elderly), and a near-even gender split (55% female, 45% male). The adult and middle-aged groups constitute 45% and 30% of clips, respectively.

Statistic	Value/Distribution	Note
Total number of clips	1,559,172	After all filtering
Effective speech hours	4,700+	Post-filtering
Average clip length	11.02 s	Max 60 s
Emotional speech share	41.8 %	Non-neutral
Gender split	55% female / 45% male	Clustered speaker IDs
Main age groups	adult 45%, middle-aged 30%	Remaining: child, teenage, elderly

3. Annotation Structure and Schema

CineDub-CN provides multi-level, multimodal annotations for each clip, supporting granular research queries in dubbing and related fields:

Audio: Clean vocal track (16 kHz WAV), separated instrumental track, and SRT transcripts with word-level timestamps, corrected by LLM-based reasoning.
Speaker Diarization: RTTM files with tuples (start, end, speakerID, gender, age), frame-aligned at 25 fps.
Visual Crops: JPEG sequences of cropped face and lip regions, sampled at 5 fps for each relevant frame.
Clue Instructions (JSON): Structured metadata containing:
- Character profile (gender, age group, timbre descriptors)
- Emotional tone descriptors (e.g., “stern”, “joyful”)
- Long-span and short-span fields indicating higher-level scene or character context
Scene Category: An integer label denoting whether the segment is a monologue, narration, dialogue, or multi-speaker scene.

4. Objective and Subjective Data Quality Assessment

Quality is systematically evaluated through quantitative and human subjective metrics:

Pronunciation and Transcript Quality: Character Error Rate (CER) and Word Error Rate (WER) computed using Whisper-large-v3. Corrected transcripts achieve CER ≈ 0.94%, versus uncorrected CER ≈ 4.53%, calculated as:

$\mathrm{CER} = \frac{S + D + I}{N} \times 100\%\quad$

where $S$ , $D$ , $I$ are substitutions, deletions, insertions and $N$ is the reference character count.

Lip-Sync Accuracy: Measured by Lip-Sync Error Distance (LSE-D ≈ 5.95) and Confidence (LSE-C ≈ 8.35) via SyncNet.
Speaker Timing and Consistency: SPK-TL metric (Speaker Truncation/Leakage, with GT=0.000 indicating perfect alignment), SPK-SIM (cosine similarity between speaker embeddings, GT=100%), EMO-SIM (emotion2vec similarity, GT=100%).
Human Ratings: Emotional and style MOS (ES-MOS) on GT = 3.94/5.00.
Filtering Criteria: Clips are discarded if transcript edit distance to ASR exceeds 50% or diarization errors are irreconcilable. Segments with overlapping speech are excluded at the filtering stage.

5. Dataset Distribution and Access

The CineDub-CN dataset (4,700+ hours used for research purposes) is publicly available for non-commercial research under a Creative Commons Attribution-NonCommercial-ShareAlike license. The full 7.2 TB release includes:

Formats: Video (MP4), audio (WAV, 16 kHz), transcript (SRT, UTF-8), diarization (RTTM), clue instructions (JSON), and metadata (CSV).
Splitting: Dataset splits are by series, with 4 clips per series held out for testing (one per scene type), 5% of remaining stratified as validation, and the remainder for training.
Acquisition: Download is enabled via a GitHub repository, with shell and Python tooling for parallel batch fetching and dataset loading.

1
2
3

git clone https://anonymous.4open.science/w/FunCineForge.git
cd CineDub-CN
bash download_clips.sh  # parallel wget for MP4, WAV, SRT, RTTM, JSON

Pythonic access is supported:

from funcineforge import CineDubCN
ds = CineDubCN(split='train', root='/data/CineDub-CN/')
for clip in ds:
    video = clip.video
    audio = clip.audio
    transcript = clip.transcript
    diar = clip.diarization
    clues = clip.clues

6. Significance for Multimodal Dubbing Research

CineDub-CN resolves persistent bottlenecks in multimodal dubbing research by providing large-scale, timestamp-aligned, and deeply annotated Chinese television data, with validated quality in both the audio and visual domains. Its multi-tier annotations—spanning audio, transcript, visual, diarization, and paralinguistic “clue” metadata—support research tasks such as:

End-to-end dubbing and voice synthesis with accurate lip sync and emotion transfer
Automated speaker and face identification in cinematic scenes
Conditioning models on high-level paralinguistic and character metadata
Benchmarking zero-shot dubbing in diverse multi-speaker, emotionally varied scenarios

Its rigorous filtering, multimodal LLM corrections, and adherence to quantitative quality controls make CineDub-CN an optimal resource and benchmark for advancing robust, expressive, and controllable automated movie dubbing (Liu et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinese Television Dubbing Dataset.