Papers
Topics
Authors
Recent
Search
2000 character limit reached

BanglaFake: Bengali Deepfake Corpus

Updated 13 January 2026
  • BanglaFake is a comprehensive Bengali deepfake speech corpus featuring over 25,000 utterances equally split between authentic and synthesized audio.
  • It employs an advanced VITS-based TTS pipeline and standardized preprocessing to generate high-quality, reproducible deepfake speech data.
  • The resource supports robust model evaluation with stratified splits, MFCC feature analysis, and promising deep learning detection benchmarks.

The BanglaFake Speech Corpus is a publicly available, large-scale dataset designed for research on Bengali deepfake audio detection. Addressing the paucity of resources for Bengali, a low-resource language, BanglaFake comprises over 25,000 utterances balanced between genuine and synthesized speech. The corpus is constructed using studio-quality and crowd-sourced recordings for real speech and advanced neural text-to-speech (TTS) models for deepfakes, with standardized preprocessing and rich metadata. As the first resource of its scale for Bengali, it underpins systematic benchmarking and development of detection algorithms, enables rigorous evaluation of cross-lingual approaches, and facilitates domain adaptation for voice-based security systems (Fahad et al., 16 May 2025, Samu et al., 25 Dec 2025).

1. Corpus Composition and Data Sources

BanglaFake contains 25,520 audio utterances, split into 12,260 genuine and 13,260 deepfake samples. Recording durations are standardized at approximately 6–7 seconds per utterance, with a sample rate of 22,050 Hz (later resampled to 16 kHz for model compatibility). Real speech is sourced from:

  • SUST TTS Corpus: 10,000 phonetically balanced Bengali utterances recorded in controlled conditions.
  • Mozilla Common Voice: 3,260 utterances from five Bengali speakers, encompassing more diverse recording conditions.

The dataset covers most Bengali phonemes, vowel length distinctions, conjunct consonants, and typical tonal/prosodic variations. Genuine samples are provided by seven distinct speakers (gender and regional breakdown not specified but likely includes both male and female adults), while all deepfake utterances are synthesized using a single male TTS voice, resulting in an imbalance with respect to speaker diversity in synthetic data (Fahad et al., 16 May 2025, Samu et al., 25 Dec 2025).

2. Deepfake Generation Pipeline and Speech Synthesis

Synthetic utterances are produced using a VITS-based TTS pipeline trained on the SUST TTS Corpus. The VITS architecture integrates variational inference and adversarial learning:

  • The posterior encoder extracts latent codes from linear-scale STFT spectrograms.
  • The prior encoder with a normalizing flow models the latent distribution, enabling expressive synthesis.
  • A stochastic duration predictor governs phoneme timing using a flow-based model.
  • The HiFi-GAN decoder serves as the vocoder, generating the raw waveform with multi-period discriminators.

The text-to-phoneme conversion is executed via a rule-based grapheme-to-phoneme (G2P) system with lexicon lookup, followed by the generation pipeline: text encoding → latent alignment with Monotonic Alignment Search (MAS) → HiFi-GAN waveform synthesis. Each output is normalized, clipped to 6–7 seconds, and saved in WAV format. Notably, no explicit augmentation techniques (e.g., noise, codec, reverberation) are applied to synthetic samples (Fahad et al., 16 May 2025).

3. Preprocessing, Splitting, and Corpus Organization

During downstream experimentation, all samples are resampled to 16 kHz. Utterance durations are normalized—either truncated or zero-padded—to 5.0 seconds for compatibility with detection models. File formats adhere to the LJ Speech convention, with CSV metadata mapping filenames to transcripts and labels ("real" or "fake"). The recommended directory structure supports stratified train/validation/test splits (typically 70/15/15 by utterance), balancing the real-vs-fake ratio and maintaining speaker independence between splits wherever possible. Class distribution is near-uniform across splits, with all metadata and audio files organized for reproducibility (Samu et al., 25 Dec 2025).

Split # Real Samples # Fake Samples Typical Use
Training ~8,582 ~9,282 Model fitting
Validation ~1,839 ~1,989 Hyperparameter selection
Test ~1,839 ~1,989 Final performance eval

4. Feature Extraction and Analysis

Acoustic analysis relies heavily on Mel-Frequency Cepstral Coefficients (MFCCs) and spectrogram-based representations:

  • MFCCs are computed with 25 ms frames and 10 ms hop, using 40 mel-filter banks. The MFCC formula is:

MFCCm=n=1NlogXn2cos[πmN(n+0.5)]\mathrm{MFCC}_m = \sum_{n=1}^N \log|X_n|^2 \cos\left[\frac{\pi m}{N}(n + 0.5)\right]

where XnX_n is the STFT coefficient at mel-bin nn, NN is the total number of filters.

  • For visualization and separability analysis, 13-dimensional MFCC vectors (framewise averages) are used as t-SNE input. Standard t-SNE settings (perplexity=30, learning rate=200, 1,000 iterations) apply; the cost function CC combines pairwise distances in high- and low-dimensional space per standard t-SNE methodology.

Key findings include substantial overlap between real and fake MFCC clusters, with synthetic high-density regions aligning with authentic speech, indicating that distinguishing real from fake is non-trivial and requires discriminative machine learning models (Fahad et al., 16 May 2025).

5. Subjective Quality Assessment

Quality evaluation uses a five-point Mean Opinion Score (MOS) protocol, rated by 30 native Bangla speakers (18 M, 12 F, aged 20–25). For each of 10 unseen sentences (presented in both real and synthetic forms), raters score (a) Naturalness ("Does it sound human-like?") and (b) Intelligibility ("Is the content clearly understood?"). Robust-MOS is computed by discarding each utterance's highest and lowest score before averaging.

  • Naturalness Robust-MOS: 3.40
  • Intelligibility Robust-MOS: 4.01

These scores indicate that synthetic speech is rated as moderately natural and highly intelligible, simulating realistic adversarial detection scenarios. No other subjective metrics are reported; analogous studies yield 95% confidence intervals for MOS of ±0.1–0.2 (Fahad et al., 16 May 2025).

6. Benchmarks, Evaluation Protocols, and Detection Results

Detection model evaluation on BanglaFake incorporates standard metrics:

  • Accuracy: (TP+TN)/(TP+TN+FP+FN)(TP + TN)/(TP + TN + FP + FN)
  • Precision, recall, F1-score, Equal Error Rate (EER), and AUC

Zero-shot detection (cross-lingual transfer) using pretrained models—Wav2Vec2-XLSR-53, Whisper, PANNsCNN14, WavLM, and Audio Spectrogram Transformer—yields limited performance (e.g., Wav2Vec2-XLSR-53: 53.80% accuracy, 56.60% AUC, 46.20% EER). In contrast, fine-tuned deep learning models significantly outperform zero-shot baselines (e.g., ResNet18: 79.17% accuracy, 84.37% AUC, 24.35% EER), demonstrating the corpus's utility for advancing Bengali-specific detection (Samu et al., 25 Dec 2025).

Suggested approaches for benchmarking include MFCC+GMM/SVM pipelines, CNNs on spectrogram inputs, and self-supervised models (wav2vec 2.0), together with cross-lingual detectors trained on ASVspoof. Best practices dictate stratified speaker-independent splits, balanced class ratios, and robustness enhancements via augmentation (noise, codec, reverberation) (Fahad et al., 16 May 2025).

7. Limitations, Accessibility, and Research Directions

The corpus's synthetic speech is limited to a single male TTS voice, reducing speaker diversity and potentially constraining generalization. Fixed utterance lengths and the exclusive use of neural TTS for deepfakes preclude evaluation on adversarial or voice conversion-based forgeries. Researchers are encouraged to extend the resource with multi-speaker TTS, female and regional voices, adversarial samples, and data captured under varied acoustic environments.

BanglaFake is distributed openly via HuggingFace, with metadata and code on GitHub, facilitating community-driven improvement and enabling reproducible benchmarks (Fahad et al., 16 May 2025, Samu et al., 25 Dec 2025).


BanglaFake serves as the foundation for Bengali deepfake audio detection research, providing standardized resources, challenging detection scenarios due to high-quality synthesis, and supporting methodological innovation in both feature engineering and deep learning. Continued expansion to address speaker diversity and adversarial robustness represents a critical trajectory for future work in low-resource speech forensics.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BanglaFake Speech Corpus.