Identify or develop the optimal neural audio codec for general-purpose self-supervised speech representation learning

Determine which neural audio codec, in terms of architecture, training methodology, and quantization strategy, produces discrete unit sequences that are optimally suited for general-purpose self-supervised speech representation learning across diverse downstream speech tasks, when these units are used as the exclusive input during pre-training.

Background

The paper introduces Codec2Vec, a self-supervised speech representation learning framework that operates exclusively on discrete units generated by neural audio codecs. The authors demonstrate competitive performance on the SUPERB benchmark while achieving substantial efficiency gains in storage and training time.

Their experiments show that the choice of codec significantly affects downstream performance: discrete units from DAC outperform those from a 16kHz Encodec variant. This variability suggests that codec architecture, training objectives, and quantization strategies influence how well the resulting discrete units support general-purpose representation learning, motivating the need to identify or develop an optimal codec for this purpose.

References

Identifying or developing the optimal codec—whose discrete units are ideally suited for general-purpose speech representation learning—remains an open research question.

Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs  (2511.16639 - Tseng et al., 20 Nov 2025) in Section: Discussion and Limitations, bullet “Codec Selection”