Unknown audio integration architecture of GPT-4o Advanced Voice Mode

Determine whether OpenAI’s GPT-4o Advanced Voice Mode processes audio by integrating an external audio encoder and adapter pipeline (e.g., Whisper-based encoders with Q-former or adapter integration as in SALMONN, Qwen-Audio, LLaSM, and DiVA) or instead represents audio via discrete audio tokens, and characterize the model’s internal audio processing architecture accordingly.

Background

The paper reviews common architectures for audio LLMs, noting that many open-source systems integrate Whisper-based audio encoders with LLMs via adapters or Q-formers, and sometimes refine instruction-following via distillation methods. GPT-4o’s advanced voice mode supports speech-to-speech interactions, but its architecture is undisclosed.

Within this context, the authors explicitly state uncertainty about whether GPT-4o follows the typical audio integration approach used by other ALMs or instead models discrete audio tokens, leaving its internal audio processing pipeline unknown.

References

It is uncertain if GPT-4o follows the audio integration methods used by other ALMs or adopts modeling discrete audio tokens \citep{nguyen2024spiritlminterleavedspokenwritten, rubenstein2023audiopalmlargelanguagemodel}.

Best-of-N Jailbreaking  (2412.03556 - Hughes et al., 2024) in Appendix: Case Study: Audio, Section “ALM Architecture Details”