BER: Balanced Error Rate For Speaker Diarization

Published 8 Nov 2022 in cs.SD, cs.CL, and eess.AS | (2211.04304v1)

Abstract: DER is the primary metric to evaluate diarization performance while facing a dilemma: the errors in short utterances or segments tend to be overwhelmed by longer ones. Short segments, e.g., yes' orno,' still have semantic information. Besides, DER overlooks errors in less-talked speakers. Although JER balances speaker errors, it still suffers from the same dilemma. Considering all those aspects, duration error, segment error, and speaker-weighted error constituting a complete diarization evaluation, we propose a Balanced Error Rate (BER) to evaluate speaker diarization. First, we propose a segment-level error rate (SER) via connected sub-graphs and adaptive IoU threshold to get accurate segment matching. Second, to evaluate diarization in a unified way, we adopt a speaker-specific harmonic mean between duration and segment, followed by a speaker-weighted average. Third, we analyze our metric via the modularized system, EEND, and the multi-modal method on real datasets. SER and BER are publicly available at https://github.com/X-LANCE/BER.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel metric, BER, which combines segment-level error rate (SER) with duration and speaker-weighted errors to improve evaluation.
It applies graph-based segment matching with adaptive IoU thresholds to accurately assess diarization performance despite arbitrary segmentation.
Experimental results demonstrate that BER highlights errors in short utterances and less-spoken speakers, setting a new standard in diarization evaluation.

BER: Balanced Error Rate For Speaker Diarization

The paper presents a novel metric, Balanced Error Rate (BER), for evaluating speaker diarization systems. Unlike existing metrics, BER aims to address the limitations of conventional measures like Diarization Error Rate (DER) and Jaccard Error Rate (JER), which tend to overlook errors in short utterances and less-spoken speakers. The proposed BER incorporates a segment-level error rate (SER) and combines it with speaker-weighted, duration, and segment errors in a unified framework.

Motivation and Problematic Areas in Conventional Metrics

Speaker Diarization, the process of identifying "who spoke when", struggles with several issues in evaluation metrics. Traditional metrics like DER and JER mainly focus on duration, disregarding the significance of errors in short utterances or from less-talked speakers. These omissions lead to a skewed evaluation as longer segments dominate error calculations, thereby obscuring performance issues in handling shorter segments that may carry important semantic information.

The introduction of the CDER metric attempted to address some limitations by considering segment-level errors. However, its fixed IoU threshold can result in bias, particularly with segments of differing lengths, leading to tolerance issues in evaluations.

Proposed Metrics: SER and BER

The paper introduces SER as a segment-level metric utilizing graph-based segment matching.

Figure 1: Graph-based segment matching demonstrates handling arbitrary segmentation with adaptive IoU.

Segment Error Rate (SER): Utilizes connected sub-graphs and adaptive IoU thresholds to accurately match segments between the reference and hypothesis. This methodology facilitates the treatment of arbitrary hypothesized segmentation without merging adjacent segments.
Balanced Error Rate (BER): Integrates SER with duration and speaker-weighted errors. The BER calculation involves a harmonic mean between duration and segment errors for each speaker, followed by a weighted average to provide an overarching assessment.

The evaluation framework is laid out in Algorithm 1, detailing speaker-specific and overall error computations.

Experimental Evaluation

Evaluation Setup

The proposed metrics are evaluated using various publicly available datasets, including AMI, CALLHOME, DIHARD2, VoxConverse, and MSDWild, employing different methods such as a modularized system, VBx, and EEND-VC. Evaluations consider both overlapped speech scenarios with no suppression of boundary collar effects, ensuring a comprehensive assessment of speaker discrimination capabilities.

Results and Discussion

Metric Comparisons Across Datasets

The results demonstrate that BER offers a more comprehensive evaluation compared to existing metrics. For instance, in the AMI dataset, despite a low JER, the high BER identifies numerous false alarms in short segments, highlighting the drawbacks of conventional measures that fail to penalize such errors effectively.

Figure 2: Overview of SER and BER across various datasets and systems.

System Comparison

Metrics applied across different diarization systems reveal insightful contrasts. The VBx system shows improved duration and segment evaluation over a modularized system, while EEND-VC is observed to have paradoxically higher DER yet better SER, suggesting its enhanced capability in discriminating segments but challenges in handling arbitrary speaker numbers, affirming the utility of BER in highlighting such nuanced disparities.

Conclusion

The introduction of BER alongside SER marks a significant advancement in the evaluation of speaker diarization systems. These metrics address critical shortcomings of existing measures by incorporating comprehensive segment-level analysis and accommodating variations in speaker participation and utterance lengths. With validated effectiveness across diverse datasets and systems, BER sets a new standard for speaker diarization evaluation, offering a balanced view of system performance that can foster further developments and improvements in diarization methodology.

Markdown Report Issue