CineDub-CN: Chinese TV Dubbing Dataset
- The Chinese Television Dubbing Dataset (CineDub-CN) is a large-scale, richly annotated corpus derived from over 6,000 hours of TV content using a four-stage automated pipeline.
- It provides comprehensive multimodal annotations including timestamped transcripts, speaker diarization, and visual lip-sync data to support advanced research in dubbing and emotion transfer.
- Rigorous quality controls and detailed paralinguistic metadata make CineDub-CN an optimal resource for breakthroughs in automated movie dubbing and zero-shot multi-speaker alignment.
The Chinese Television Dubbing Dataset refers to CineDub-CN, a large-scale, richly annotated corpus constructed through FunCineForge, an end-to-end automated pipeline for generating high-quality multimodal dubbing data from Chinese television series. This corpus is designed to facilitate advanced research in speech synthesis, audio-visual alignment, speaker identity modeling, and emotion transfer in zero-shot movie dubbing across complex, multi-speaker cinematic scenes (Liu et al., 21 Jan 2026).
1. End-to-End Data Production Pipeline
The CineDub-CN dataset is generated via a fully automated four-stage workflow that transforms over 6,000 hours of raw television footage into structured and annotated audio-visual clips. The pipeline stages are as follows:
- Video Standardization and Segmentation: All raw series are re-encoded to MP4 with standardized codecs, removing opening and ending credits. Speech-active regions are identified using a long-sequence–optimized FSMN-Monophone VAD. These intervals are fed into FunASR for automatic speech recognition, and transcripts are segmented at punctuation boundaries to create sentence-level SRT files with timestamps. Clips are trimmed to a maximum of 60 seconds.
- Vocal Separation and Overlap Filtering: Mel-RoFormer separates vocal and instrumental tracks in the audio. An overlapping-speech detector discards clips with concurrent multi-party speech, ensuring downstream diarization only processes single-speaker intervals.
- Audio-Visual Speaker Diarization: CAM++ generates speaker embeddings at 25 Hz, while CosyVoice 3 tokenizes clean speech. On the video side, frames sampled at 5 fps are analyzed: faces are detected (using a lightweight detector), scored for speaking activity by TalkNet-ASD, and processed by a FAN-based keypoint localizer for lip/face crops. CurricularFace computes normalized face embeddings. These modalities are fused via a 3D-Speaker inspired joint clustering to yield RTTM diarization files labeled with speaker, gender, and age bracket.
- Multimodal Chain-of-Thought Correction: Gemini 2.5-Pro, a multimodal LLM, performs chain-of-thought correction on transcripts and diarization tuples, reconciling mis-segmentations, duplicate identities, and extracting paralinguistic attributes such as timbre and emotion. A bidirectional verification step filters out clips with excessive transcript errors (Levenshtein edit distance > 50%) or unsolvable diarization conflicts. Clips are labeled by scene type based on speaker count and visible face data.
2. Corpus Scale and Statistical Breakdown
CineDub-CN comprises 1,559,172 clips, amounting to 7.2 TB of data and more than 4,700 hours of effective speech after filtering. Key statistical properties include:
- Average Clip Length: 11.02 seconds (with variability up to 60 seconds).
- Emotional Speech: 41.8% of clips contain non-neutral emotional speech.
- Instruction Text: Average “clue”/instruction length is 62 words (variance 25).
- Scene-Type Distribution: Approximately 22% monologue, 18% narration, 35% dialogue, and 25% multi-speaker scenes.
Speaker demographic coverage is broad, with clustering identifying five age groups (child, teenager, adult, middle-aged, elderly), and a near-even gender split (55% female, 45% male). The adult and middle-aged groups constitute 45% and 30% of clips, respectively.
| Statistic | Value/Distribution | Note |
|---|---|---|
| Total number of clips | 1,559,172 | After all filtering |
| Effective speech hours | 4,700+ | Post-filtering |
| Average clip length | 11.02 s | Max 60 s |
| Emotional speech share | 41.8 % | Non-neutral |
| Gender split | 55% female / 45% male | Clustered speaker IDs |
| Main age groups | adult 45%, middle-aged 30% | Remaining: child, teenage, elderly |
3. Annotation Structure and Schema
CineDub-CN provides multi-level, multimodal annotations for each clip, supporting granular research queries in dubbing and related fields:
- Audio: Clean vocal track (16 kHz WAV), separated instrumental track, and SRT transcripts with word-level timestamps, corrected by LLM-based reasoning.
- Speaker Diarization: RTTM files with tuples (start, end, speakerID, gender, age), frame-aligned at 25 fps.
- Visual Crops: JPEG sequences of cropped face and lip regions, sampled at 5 fps for each relevant frame.
- Clue Instructions (JSON): Structured metadata containing:
- Character profile (gender, age group, timbre descriptors)
- Emotional tone descriptors (e.g., “stern”, “joyful”)
- Long-span and short-span fields indicating higher-level scene or character context
- Scene Category: An integer label denoting whether the segment is a monologue, narration, dialogue, or multi-speaker scene.
4. Objective and Subjective Data Quality Assessment
Quality is systematically evaluated through quantitative and human subjective metrics:
- Pronunciation and Transcript Quality: Character Error Rate (CER) and Word Error Rate (WER) computed using Whisper-large-v3. Corrected transcripts achieve CER ≈ 0.94%, versus uncorrected CER ≈ 4.53%, calculated as:
where , , are substitutions, deletions, insertions and is the reference character count.
- Lip-Sync Accuracy: Measured by Lip-Sync Error Distance (LSE-D ≈ 5.95) and Confidence (LSE-C ≈ 8.35) via SyncNet.
- Speaker Timing and Consistency: SPK-TL metric (Speaker Truncation/Leakage, with GT=0.000 indicating perfect alignment), SPK-SIM (cosine similarity between speaker embeddings, GT=100%), EMO-SIM (emotion2vec similarity, GT=100%).
- Human Ratings: Emotional and style MOS (ES-MOS) on GT = 3.94/5.00.
- Filtering Criteria: Clips are discarded if transcript edit distance to ASR exceeds 50% or diarization errors are irreconcilable. Segments with overlapping speech are excluded at the filtering stage.
5. Dataset Distribution and Access
The CineDub-CN dataset (4,700+ hours used for research purposes) is publicly available for non-commercial research under a Creative Commons Attribution-NonCommercial-ShareAlike license. The full 7.2 TB release includes:
- Formats: Video (MP4), audio (WAV, 16 kHz), transcript (SRT, UTF-8), diarization (RTTM), clue instructions (JSON), and metadata (CSV).
- Splitting: Dataset splits are by series, with 4 clips per series held out for testing (one per scene type), 5% of remaining stratified as validation, and the remainder for training.
- Acquisition: Download is enabled via a GitHub repository, with shell and Python tooling for parallel batch fetching and dataset loading.
1 2 3 |
git clone https://anonymous.4open.science/w/FunCineForge.git cd CineDub-CN bash download_clips.sh # parallel wget for MP4, WAV, SRT, RTTM, JSON |
1 2 3 4 5 6 7 8 |
from funcineforge import CineDubCN ds = CineDubCN(split='train', root='/data/CineDub-CN/') for clip in ds: video = clip.video audio = clip.audio transcript = clip.transcript diar = clip.diarization clues = clip.clues |
6. Significance for Multimodal Dubbing Research
CineDub-CN resolves persistent bottlenecks in multimodal dubbing research by providing large-scale, timestamp-aligned, and deeply annotated Chinese television data, with validated quality in both the audio and visual domains. Its multi-tier annotations—spanning audio, transcript, visual, diarization, and paralinguistic “clue” metadata—support research tasks such as:
- End-to-end dubbing and voice synthesis with accurate lip sync and emotion transfer
- Automated speaker and face identification in cinematic scenes
- Conditioning models on high-level paralinguistic and character metadata
- Benchmarking zero-shot dubbing in diverse multi-speaker, emotionally varied scenarios
Its rigorous filtering, multimodal LLM corrections, and adherence to quantitative quality controls make CineDub-CN an optimal resource and benchmark for advancing robust, expressive, and controllable automated movie dubbing (Liu et al., 21 Jan 2026).