SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Published 8 Aug 2025 in cs.SD and cs.AI | (2508.06372v1)

Abstract: The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal LLM for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a unified SDR framework that integrates audio encoding with LLM processing to overcome error propagation in cascaded systems.
It employs a four-stage training strategy, including ASR and joint fine-tuning with LoRA, to enhance performance across diverse SDR scenarios.
Experimental results demonstrate superior accuracy on metrics like CER and saCER, validating the model in both in-domain and out-of-domain tests.

Overview

The paper introduces SpeakerLM, an end-to-end model that combines LLMs with audio processing to perform Speaker Diarization and Recognition (SDR). The approach integrates Speaker Diarization (SD) and Automatic Speech Recognition (ASR) into a unified framework using a multimodal LLM (MLLM). This model overcomes the limitations inherent in traditional cascaded systems by enabling simultaneous handling of both audio encoding and speaker attribution.

Model Architecture

SpeakerLM utilizes a sophisticated architectural setup to achieve joint SDR. It consists of an audio encoder from a pre-trained SenseVoice-large model, which is followed by a Transformer-based projector to align audio embeddings with the text embedding space. The audio features are then processed by a linguistic model, Qwen2.5-7B-Instruct, a capable LLM. Speaker registration is handled flexibly by using pre-trained speaker embeddings, which are processed through a linear layer for dimensional compatibility.

Figure 1: Overall architecture of the proposed SpeakerLM. When speaker registration is not performed, each speaker in the transcript is identified by an anonymous ID. With speaker registration enabled, speakers are labeled by their actual names.

The system architecture supports three speaker registration modes:

No-Regist: where no prior speaker information is registered.
Match-Regist: where every speaker appearing in the audio is pre-registered.
Over-Regist: where more speakers are registered than appear in the audio.
Figure 2: Overview of the proposed three different speaker registration mechanisms: (a) No-Regist (no speaker registration) (b) Match-Regist (exact actual speakers in the audio are pre-registered) (c) Over-Regist (more speakers are registered than actual speakers in the audio).

Training Strategy

SpeakerLM employs a four-stage training strategy to maximize performance across different SDR scenarios:

ASR Training Stage: Using ASR data to optimize the model for accurate speech recognition, using a LoRA technique to preserve LLM's linguistic capabilities during adaptation.
Simulated SDR Training: Using synthetic SDR data to align audio-text modalities quickly and effectively.
Real Data Training: Fine-tuning with real-world SDR data to capture practical acoustic characteristics.
Joint Fine-tuning: Comprehensive fine-tuning across all components, leveraging LoRA to enhance joint optimization.

Experimental Results

The model's performance was tested on in-domain and out-of-domain datasets, illustrating its capacity to outperform state-of-the-art SDR systems. Specifically, SpeakerLM demonstrated remarkable improvements in metrics such as Character Error Rate (CER), concatenated minimum permutation CER (cpCER), and speaker-attributed CER (saCER).

Figure 3: The performance of SpeakerLM on test sets under No-Regist condition across different training stages.

Figure 4: The saCER of SpeakerLM under Over-Regist condition with different numbers of over-registered speakers.

SpeakerLM showed robustness and scalability, with superior results in both scenarios with and without speaker registration. It particularly excels by minimizing error propagation issues that affect traditional cascaded SDR frameworks.

Implications and Future Work

SpeakerLM's application demonstrates the potential of MLLMs in SDR, enabling systems that handle complex multi-speaker environments efficiently. This model's architecture shows promise for future enhancements, such as refining speaker attribution with more advanced registration mechanisms or integrating additional modalities like visual data for richer context recognition.

The system's ability to scale with data suggests potential for expanding its application to various languages and environments, potentially developing more generalized models for global adoption in automatic transcription systems.

Conclusion

SpeakerLM presents a significant advancement in the domain of SDR systems by combining multimodal processing with language modeling. By using an end-to-end approach, it mitigates issues arising from error propagation found in cascaded systems and provides a flexible framework adaptable to diverse speaker settings. The paper sets a benchmark for further exploration into multimodal modeling, highlighting the increasing role of LLMs in audio processing tasks.

Markdown Report Issue