Word-level probability that accounts for multiple tokenizations in Whisper

Derive a word-level probability or uncertainty estimate for Whisper that correctly marginalizes over all valid tokenization sequences corresponding to the same spoken word (e.g., different spacings, casing, or subword splits), rather than aggregating token log-probabilities for a single tokenization.

Background

The authors compute word-level uncertainty from Whisper by aggregating token log-probabilities mapped to each word. They observe that the same spoken word can correspond to multiple valid token sequences (e.g., different casing or subword splits), which means probability mass is spread across tokenizations.

They explicitly state that they did not account for this and defer it to future work, indicating a need for a principled approach to marginalize over tokenizations when estimating word-level probabilities or uncertainties.

References

For example, ' cat', ' Cat', 'Cat' and ' C'+'at' are different token sequences in Whisper, and the probability of the spoken word 'cat' is distributed between them. We didn't take this into account, leaving it for a future work.

Pisets: A Robust Speech Recognition System for Lectures and Interviews  (2601.18415 - Bondarenko et al., 26 Jan 2026) in Whisper scores (Section: Uncertainty modeling)