DiarizationLM: Speaker Diarization Post-Processing with Large Language Models

Published 7 Jan 2024 in eess.AS, cs.LG, and cs.SD | (2401.03506v11)

Abstract: In this paper, we introduce DiarizationLM, a framework to leverage LLMs (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.

Abstract PDF HTML Upgrade to Chat

References (60)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces DiarizationLM, which leverages finetuned LLMs to post-process ASR outputs and significantly reduce speaker diarization errors.
It employs a modular framework with a Transcript-Preserving Speaker Transfer algorithm to ensure accurate speaker label transfer without retraining existing models.
Experiments demonstrate up to a 55.5% reduction in Word Diarization Error Rate on benchmark datasets like Fisher and Callhome.

DiarizationLM: Speaker Diarization Post-Processing with LLMs

The paper "DiarizationLM: Speaker Diarization Post-Processing with LLMs" introduces DiarizationLM, a framework designed to leverage LLMs for enhancing the outputs of speaker diarization systems. This framework aims at improving the readability of diarized transcripts and reducing Word Diarization Error Rate (WDER) by integrating LLMs as a post-processing step. DiarizationLM can be seamlessly applied to existing automatic speech recognition (ASR) and speaker diarization systems without the need for retraining.

Framework Overview

The proposed DiarizationLM framework incorporates several modules to process the outputs from ASR and speaker diarization systems. These outputs are transformed into a compact textual format, embedded in a prompt, and then processed by a finetuned LLM. This approach permits the refinement of diarization results with desired enhancements. The process is modular, enabling flexibility by accommodating various ASR and speaker diarization models.

Figure 1: Diagram of the proposed DiarizationLM framework.

Transcript-Preserving Speaker Transfer

A crucial component of DiarizationLM is the Transcript-Preserving Speaker Transfer (TPST) algorithm, which ensures that speaker labels are accurately transferred from source sequences (model outputs) to target sequences (ground truth). This maintains the integrity of ASR transcripts while ensuring consistency in speaker allocation, even amidst discrepancies in word sequences from both systems.

Prompt Construction and LLM Integration

The framework's prompt builder constructs compact textual representations of diarization results, while the completion parser integrates these with LLMs to generate refined outputs. These are structured to preserve ASR-transcribed words, mitigating potential errors in word transfer. Furthermore, by leveraging TPST, the framework assigns speaker labels more consistently.

Implementation and Results

The experiments utilize a finetuned PaLM 2-S model, yielding significant reductions in WDER — a relative improvement of approximately 55.5% on the Fisher corpus and 44.9% on Callhome dataset. These results underscore the efficacy of DiarizationLM in mitigating diarization errors using LLM finetuning. The implementation requires computing significant resources for LLM fine-tuning and tailored TPST algorithm execution.

Figure 2: A comparison of the learning behavior of GEPA against MIPROv2 and GRPO, showcasing superior final scores and efficient learning via rollouts.

Discussion and Future Work

DiarizationLM demonstrates LLMs' potential in enhancing diarization through semantic correction, achieving notable error reductions without retraining underlying ASR or speaker diarization models. Future work may explore multilingual adaptations, broader domain evaluations, and integration with alternative diarization approaches (e.g., end-to-end systems or unsupervised clustering).

Conclusion

The DiarizationLM framework innovatively applies LLMs to optimize speaker diarization processes, achieving substantial improvements in error rates without altering foundational ASR or diarization models. This methodological advancement offers promising directions for refining diarization with advanced natural language processing tools, particularly in complex audio transcription tasks.