Optimality of UTF-8 as a byte-level output representation for ASR

Determine whether UTF-8 byte encoding is optimal as a byte-level output representation for end-to-end automatic speech recognition (ASR) systems.

Background

Byte-level output representations are widely used in multilingual end-to-end ASR to limit the output vocabulary size, with UTF-8 being a common choice due to its compactness and universality. However, UTF-8 is a variable-length prefix code not designed specifically for machine learning, and many byte sequences are invalid UTF-8 strings, imposing an additional burden on ASR models to avoid or repair such sequences.

The paper proposes a data-driven alternative using a vector-quantized auto-encoder optimized for ASR, highlighting that it remains unclear whether UTF-8 is optimal for ASR tasks, thereby motivating investigation into the optimality of UTF-8 compared to learned representations.

References

While UTF-8 has proven to be an effective output representation for ASR, it is unclear whether it is optimal.

Optimizing Byte-level Representation for End-to-end ASR  (2406.09676 - Hsiao et al., 2024) in Section 1 (Introduction)