Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models

Published 7 Jan 2024 in eess.AS, cs.LG, and cs.SD | (2401.03506v11)

Abstract: In this paper, we introduce DiarizationLM, a framework to leverage LLMs (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
  2. “Speaker diarization: A journey from unsupervised to supervised approaches,” Odyssey: The Speaker and Language Recognition Workshop, 2022, Tutorial session.
  3. “Feature learning with raw-waveform CLDNNs for voice activity detection,” in Proc. Interspeech, 2016.
  4. “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” in Proc. Interspeech, 2020.
  5. “Personal VAD: Speaker-conditioned voice activity detection,” in Odyssey: The Speaker and Language Recognition Workshop, 2020.
  6. “Personal vad 2.0: Optimizing personal voice activity detection for on-device speech recognition,” arXiv preprint arXiv:2204.03793, 2022.
  7. “Turn-to-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8077–8081.
  8. “Augmenting transformer-transducer based speaker change detection with token-level training loss,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
  9. “Generalized end-to-end loss for speaker verification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
  10. “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
  11. “X-Vectors: Robust dnn embeddings for speaker recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
  12. “Speaker diarization with LSTM,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5239–5243.
  13. “Speaker diarization using deep neural network embeddings,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4930–4934.
  14. “Developing on-line speaker diarization system,” in Proc. Interspeech, 2017, pp. 2739–2743.
  15. “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Processing Letters, vol. 27, pp. 381–385, 2019.
  16. “Multi-scale speaker diarization with neural affinity score fusion,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7173–7177.
  17. “Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering,” arXiv preprint arXiv:2210.13690, 2022.
  18. “Fully supervised speaker diarization,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6301–6305.
  19. “Discriminative neural clustering for speaker diarisation,” in Spoken Language Technology Workshop (SLT). IEEE, 2021.
  20. “End-to-end neural speaker diarization with permutation-free objectives,” in Proc. Interspeech, 2019, pp. 4300–4304.
  21. “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” arXiv preprint arXiv:2203.17068, 2022.
  22. “Speaker diarization using an end-to-end model,” US Patent US011545157B2, 2019.
  23. “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” arXiv preprint arXiv:2005.09921, 2020.
  24. “Who said what? Recorder’s on-device solution for labeling speakers,” Google AI Blog.
  25. “Joint speech recognition and speaker diarization via sequence transduction,” in Proc. Interspeech, 2019, pp. 396–400.
  26. “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” arXiv preprint arXiv:2006.10930, 2020.
  27. “Minimum bayes risk training for end-to-end speaker-attributed ASR,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6503–6507.
  28. “End-to-end speaker-attributed ASR with transformer,” arXiv preprint arXiv:2104.02128, 2021.
  29. “Streaming speaker-attributed ASR with token-level speaker embeddings,” arXiv preprint arXiv:2203.16685, 2022.
  30. “Speaker-aware neural network based beamformer for speaker extraction in speech mixtures,” in Proc. Interspeech, 2017, pp. 2655–2659.
  31. “Single channel target speaker extraction and recognition with speaker beam,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5554–5558.
  32. “End-to-end SpeakerBeam for single channel target speech recognition,” in Proc. Interspeech, 2019, pp. 451–455.
  33. “Auxiliary interference speaker loss for target-speaker speech recognition,” arXiv preprint arXiv:1906.10876, 2019.
  34. “Towards word-level end-to-end neural speaker diarization with auxiliary network,” arXiv preprint arXiv:2309.08489, 2023.
  35. “Lexical speaker error correction: Leveraging language models for speaker diarization error correction,” arXiv preprint arXiv:2306.09313, 2023.
  36. “An overview of Bard: an early experiment with generative AI,” https://ai.google/static/documents/google-about-bard.pdf, 2023.
  37. OpenAI, “Introducing ChatGPT,” https://openai.com/blog/chatgpt, 2022.
  38. Vladimir I Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Soviet physics doklady, vol. 10, no. 8, pp. 707–710, 1966.
  39. Harold W Kuhn, “The Hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  40. “The Fisher corpus: A resource for the next generations of speech-to-text,” in LREC, 2004, vol. 4, pp. 69–71.
  41. “CALLHOME American English speech LDC97S42,” LDC Catalog. Philadelphia: Linguistic Data Consortium, 1997.
  42. “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” arXiv preprint arXiv:2004.09249, 2020.
  43. “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  44. Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  45. “PaLM 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  46. “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018.
  47. “Diarization resegmentation in the factor analysis subspace,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4794–4798.
  48. “Neural speech turn segmentation and affinity propagation for speaker diarization,” in Proc. Interspeech, 2018, pp. 1393–1397.
  49. “DiaCorrect: Error correction back-end for speaker diarization,” arXiv preprint arXiv:2309.08377, 2023.
  50. “The majority wins: a method for combining speaker diarization systems,” in Proc. Interspeech, 2009.
  51. “System output combination for improved speaker diarization,” in Proc. Interspeech, 2010.
  52. “DOVER: A method for combining diarization outputs,” in Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 757–763.
  53. “End-to-end speaker diarization as post-processing,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7188–7192.
  54. “Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams,” in International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2012, pp. 118–123.
  55. “Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks,” in Proc. Interspeech, 2018, pp. 1373–1377.
  56. “Speaker diarization with lexical information,” arXiv preprint arXiv:2004.06756, 2020.
  57. “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  58. “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
  59. “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  60. “Enhancing speaker diarization with large language models: A contextual beam search approach,” arXiv preprint arXiv:2309.05248, 2023.
Citations (7)

Summary

  • The paper introduces DiarizationLM, which leverages finetuned LLMs to post-process ASR outputs and significantly reduce speaker diarization errors.
  • It employs a modular framework with a Transcript-Preserving Speaker Transfer algorithm to ensure accurate speaker label transfer without retraining existing models.
  • Experiments demonstrate up to a 55.5% reduction in Word Diarization Error Rate on benchmark datasets like Fisher and Callhome.

DiarizationLM: Speaker Diarization Post-Processing with LLMs

The paper "DiarizationLM: Speaker Diarization Post-Processing with LLMs" introduces DiarizationLM, a framework designed to leverage LLMs for enhancing the outputs of speaker diarization systems. This framework aims at improving the readability of diarized transcripts and reducing Word Diarization Error Rate (WDER) by integrating LLMs as a post-processing step. DiarizationLM can be seamlessly applied to existing automatic speech recognition (ASR) and speaker diarization systems without the need for retraining.

Framework Overview

The proposed DiarizationLM framework incorporates several modules to process the outputs from ASR and speaker diarization systems. These outputs are transformed into a compact textual format, embedded in a prompt, and then processed by a finetuned LLM. This approach permits the refinement of diarization results with desired enhancements. The process is modular, enabling flexibility by accommodating various ASR and speaker diarization models. Figure 1

Figure 1: Diagram of the proposed DiarizationLM framework.

Transcript-Preserving Speaker Transfer

A crucial component of DiarizationLM is the Transcript-Preserving Speaker Transfer (TPST) algorithm, which ensures that speaker labels are accurately transferred from source sequences (model outputs) to target sequences (ground truth). This maintains the integrity of ASR transcripts while ensuring consistency in speaker allocation, even amidst discrepancies in word sequences from both systems.

Prompt Construction and LLM Integration

The framework's prompt builder constructs compact textual representations of diarization results, while the completion parser integrates these with LLMs to generate refined outputs. These are structured to preserve ASR-transcribed words, mitigating potential errors in word transfer. Furthermore, by leveraging TPST, the framework assigns speaker labels more consistently.

Implementation and Results

The experiments utilize a finetuned PaLM 2-S model, yielding significant reductions in WDER — a relative improvement of approximately 55.5% on the Fisher corpus and 44.9% on Callhome dataset. These results underscore the efficacy of DiarizationLM in mitigating diarization errors using LLM finetuning. The implementation requires computing significant resources for LLM fine-tuning and tailored TPST algorithm execution. Figure 2

Figure 2

Figure 2: A comparison of the learning behavior of GEPA against MIPROv2 and GRPO, showcasing superior final scores and efficient learning via rollouts.

Discussion and Future Work

DiarizationLM demonstrates LLMs' potential in enhancing diarization through semantic correction, achieving notable error reductions without retraining underlying ASR or speaker diarization models. Future work may explore multilingual adaptations, broader domain evaluations, and integration with alternative diarization approaches (e.g., end-to-end systems or unsupervised clustering).

Conclusion

The DiarizationLM framework innovatively applies LLMs to optimize speaker diarization processes, achieving substantial improvements in error rates without altering foundational ASR or diarization models. This methodological advancement offers promising directions for refining diarization with advanced natural language processing tools, particularly in complex audio transcription tasks.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 21 likes about this paper.