Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024

Published 16 Nov 2024 in eess.AS, cs.CL, and cs.LG | (2411.10828v1)

Abstract: This paper presents our submissions to the Iranian division of the Text-dependent Speaker Verification Challenge (TdSV) 2024. TdSV aims to determine if a specific phrase was spoken by a target speaker. We developed two independent subsystems based on pre-trained models: For phrase verification, a phrase classifier rejected incorrect phrases, while for speaker verification, a pre-trained ResNet293 with domain adaptation extracted speaker embeddings for computing cosine similarity scores. In addition, we evaluated Whisper-PMFA, a pre-trained ASR model adapted for speaker verification, and found that, although it outperforms randomly initialized ResNets, it falls short of the performance of pre-trained ResNets, highlighting the importance of large-scale pre-training. The results also demonstrate that achieving competitive performance on TdSV without joint modeling of speaker and text is possible. Our best system achieved a MinDCF of 0.0358 on the evaluation subset and won the challenge.

Abstract PDF Upgrade to Chat

Authors (1)

Seyed Ali Farokh

Summary

The paper presents two independent verification subsystems that leverage pre-trained bilingual models for distinct phrase and speaker validation.
The research demonstrates that fine-tuned Whisper-PMFA significantly outperforms randomly initialized models while underlining the importance of domain-adapted pre-training.
The study employs diverse datasets and advanced training protocols, achieving a MinDCF of 0.0358 and setting a new benchmark for the TdSV Challenge.

Overview of Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024

This paper details the development and evaluation of speaker verification systems designed for the Iranian division of the Text-dependent Speaker Verification (TdSV) Challenge 2024. The central objective of TdSV is to establish whether a particular phrase was articulated by a target speaker, necessitating the integration of phrase verification and speaker verification. Notably, the authors present two independent subsystems, capitalizing on the discriminative capabilities of pre-trained models.

Methodology and System Architecture

The approach comprises separate systems for phrase and speaker verification. For phrase verification, a classifier was employed to reject incorrect phrases. The authors opted for a cross-lingual speech representation model fine-tuned for bilingual (Persian and English) speech recognition. This model further underwent fine-tuning for classifying specific phrases. Conversely, the speaker verification subsystem utilized a pre-trained ResNet293 model, augmented through domain adaptation, and Whisper-PMFA, an automatic speech recognition (ASR) informed model. The Whisper-PMFA model was specifically evaluated to discern its effectiveness compared to randomly initialized models.

Performance Evaluation and Results

Experimental results reveal that when Whisper-PMFA is fine-tuned for the speaker verification task, it substantially surpasses randomly initialized ResNet models; however, it lags behind pre-trained ResNets post-domain adaptation, underscoring the contribution of large-scale pre-training to model efficacy. The authors report attaining a MinDCF of 0.0358 on the evaluation subset, positioning their top-performing model at the forefront of the challenge.

Dataset and Training Protocol

The systems were trained predominantly on the DeepMine dataset, enhanced with data from VoxCeleb 1, Common Voice, and LibriSpeech, adhering to the constraints of the challenge. Various training strategies were employed including full training to foster robust embeddings from diverse datasets and domain adaptation for bridging in-domain data gaps. AAM-Softmax with subcenter methods and inter-TopK penalties were key in achieving the reported performance outcomes. Furthermore, speaker embeddings were fine-tuned to adapt to in-domain data, yielding improved system accuracy.

Implications and Future Directions

The findings underline the potential of leveraging pre-trained ASR models like Whisper for speaker verification tasks. The successful disentanglement of speaker and phrase verifications using independent subsystems points to novel architectures that could balance simplicity with performance. Given the strategic adoption of domain adaptation and large-scale pre-training, future work may explore extending such pre-trained frameworks to broader multilingual contexts and varied auditory environments, thus contributing to more generalized applications in voice-driven security systems.

In conclusion, the research illustrates a commendable effort integrating sophisticated pre-trained models for speaker verification without joint modeling of text and speaker identity. This could catalyze future developments focusing on holistic yet modular approaches in the design of speaker verification systems. The independent use of pre-trained components elucidated in this study sets a precedent for efficiently capitalizing on large-scale data in developing robust speaker verification systems.

Markdown Report Issue