Unknown audio integration architecture of GPT-4o Advanced Voice Mode
Determine whether OpenAI’s GPT-4o Advanced Voice Mode processes audio by integrating an external audio encoder and adapter pipeline (e.g., Whisper-based encoders with Q-former or adapter integration as in SALMONN, Qwen-Audio, LLaSM, and DiVA) or instead represents audio via discrete audio tokens, and characterize the model’s internal audio processing architecture accordingly.
References
It is uncertain if GPT-4o follows the audio integration methods used by other ALMs or adopts modeling discrete audio tokens \citep{nguyen2024spiritlminterleavedspokenwritten, rubenstein2023audiopalmlargelanguagemodel}.
— Best-of-N Jailbreaking
(2412.03556 - Hughes et al., 2024) in Appendix: Case Study: Audio, Section “ALM Architecture Details”