Word-level probability that accounts for multiple tokenizations in Whisper
Derive a word-level probability or uncertainty estimate for Whisper that correctly marginalizes over all valid tokenization sequences corresponding to the same spoken word (e.g., different spacings, casing, or subword splits), rather than aggregating token log-probabilities for a single tokenization.
References
For example, ' cat', ' Cat', 'Cat' and ' C'+'at' are different token sequences in Whisper, and the probability of the spoken word 'cat' is distributed between them. We didn't take this into account, leaving it for a future work.
— Pisets: A Robust Speech Recognition System for Lectures and Interviews
(2601.18415 - Bondarenko et al., 26 Jan 2026) in Whisper scores (Section: Uncertainty modeling)