Voxtral Small: 24.3B Multimodal Transformer
- Voxtral Small is a 24.3B-parameter multimodal Transformer that integrates audio and text processing to transcribe, translate, and answer queries from up to 40 minutes of continuous audio.
- Its architecture combines a frozen Whisper encoder, a two-layer audio-language adapter, and a Mistral Small decoder, achieving competitive benchmarks in ASR, translation, and speech understanding.
- The model supports long-form audio processing with a 32K-token context window and is optimized for efficient deployment via Apache 2.0 licensing and 4-bit quantization.
Voxtral Small is a 24.3 billion-parameter multimodal Transformer model designed for integrated spoken audio and text understanding, featuring the ability to transcribe, translate, and answer open-domain queries about up to 40 minutes of continuous audio within a 32K-token context window. The architecture combines a state-of-the-art frozen Whisper large-v3 speech encoder, a two-layer audio-language adapter, and a Mistral Small 3.1 LLM decoder. Voxtral Small achieves state-of-the-art benchmarks across transcription, translation, and speech understanding, while maintaining highly competitive text-only performance. The model is released under the Apache 2.0 license, with public weights and source code.
1. Architecture and Parameterization
Voxtral Small follows a three-stage sequential pipeline:
- Audio Encoder: A frozen Whisper large-v3 module (640M parameters) encodes 128-bin log-Mel spectrograms sampled at 50 Hz. Input waveforms are divided into non-overlapping 30-second windows, with positional encodings reset per chunk. Each chunk yields where frames.
- Audio-Language Adapter: A two-layer MLP (≈52M parameters) acts as both projector and temporal down-sampler. It applies and selects every fourth frame along the time axis, reducing to , e.g., for a 40-minute input.
- Language Decoder: An auto-regressive Transformer using the Mistral Small 3.1 24B backbone (≈22.9B parameters plus 670M in text token embeddings) handles multimodal inference and generation.
Parameter Breakdown
| Component | Parameter Count | Description |
|---|---|---|
| Audio Encoder | 640M | Frozen Whisper large-v3 |
| Audio-Language Adapter | 52M | 2-layer MLP downsampler |
| Text Embeddings | 670M | Token embedding layer |
| Language Decoder | 22.9B | Mistral Small 3.1 backbone |
| Total | 24.3B |
The architecture is engineered to accommodate up to 40 minutes of audio and a few thousand text tokens within the 32K-model context window without significant performance loss (Liu et al., 17 Jul 2025).
2. Training Methodology and Objectives
Pretraining
Pretraining leverages a large corpus of (audio, transcript) pairs from public and proprietary sources, segmented via VAD. Each audio chunk is paired with (transcript, either annotated or pseudo-labeled). Two patterns are mixed with equal probability:
- Audio-to-Text Repetition (<repeat> mode): Input is with target , enforcing local alignment.
- Cross-Modal Continuation (<next> mode): Input , target , fostering discourse continuity.
The pretraining loss is:
Pretraining initially restricts optimization to the adapter; Whisper and decoder are frozen. Text-only LLM pretraining (cross-entropy on text corpora) is interleaved to maintain strong unimodal performance.
Supervised Finetuning
Synthetic and real tasks spanning long-form audio (up to 40 minutes) are generated, including QA, summarization, and translation, utilizing TTS conversion to augment audio-only instruction datasets and reduce TTS-only bias. The objective remains token-wise cross-entropy. A special <transcribe> mode triggers verbatim ASR.
Preference Alignment (Online DPO)
Direct Preference Optimization (DPO) is applied in an online setting. Candidate responses are sampled (temperature ), scored by a reward model on ASR transcriptions, and the loss
is optimized to sharpen output quality and reduce hallucinations.
3. Empirical Performance
Voxtral Small establishes new state-of-the-art results on a battery of open and closed speech and multimodal benchmarks:
Speech Recognition (WER %)
- LibriSpeech Test Clean: 1.53% (vs. Whisper v3 1.84%, GPT-4o mini 1.92%)
- LibriSpeech Test Other: 3.14% (vs. 3.66%, 4.70%)
- Mozilla CV (non-Arabic avg): ≈5.7% (vs. 8.2%, 11.1%)
- FLEURS (all languages avg): ≈5.1% (vs. 5.9%, 6.2%)
Speech Translation (BLEU)
- en→de: 47.0 (vs. 44.5 GPT-4o mini, 44.6 Gemini 2.5)
- en→fr: 57.3 (vs. 52.7, 53.9)
- de→en: 56.6 (vs. 51.8, 39.4)
- it→en: 46.8 (vs. 41.5, 31.8)
Speech Understanding Accuracy (%)
- Average: 77.4%
- Llama QA: 71.7% (vs. 74.3% GPT-4o mini, 66.3% Gemini 2.5)
- Openbook QA: 88.4% (vs. 83.7%, 94.7%)
- GSM8K: 89.7% (vs. 90.8%, 94.2%)
- In-house SU: 86.6% (vs. 80.0%, 88.6%)
Text-only performance matches the Mistral 3.1 backbone to within statistical noise, enabling drop-in multimodal and unimodal operation (Liu et al., 17 Jul 2025).
4. Extended Context Window and Long-Form Audio Processing
Processing up to 40 minutes of audio within a single inference pass requires:
- Chunk-Wise Encoder Attention: Each 30-second audio window is processed independently (complexity , frames), and chunk embeddings are concatenated.
- 4× Temporal Downsampling: The adapter reduces token count by a factor of 4, achieving . For s (40 min), , leaving nearly 2,000 text tokens—fitting in the 32K-token decoder window.
Total decoder memory scales as . This enables 40-minute audio plus conversation with no significant ASR/understanding loss compared to non-downsampled input. A plausible implication is that further downsampling (beyond 4×) would be detrimental, incurring >1% WER loss on non-English benchmarks.
5. Practical Usage, Licensing, and Deployment
- License: Apache 2.0; no GPU-usage fees; weights and code hosted at https://huggingface.co/mistralai/Voxtral-Small-24B-2507.
- Size/Memory: 24.3B parameters (~48 GB fp16). 4-bit quantization reduces RAM to ≈12 GB, enabling single-GPU (A6000/RTX 4090) hosting.
- Recommended Hardware: ≥24 GB VRAM for quantized inference; ≥48 GB VRAM (two GPUs/model-parallel) for fp16. CPU-only possible but impractically slow.
- PyTorch Inference Example:
1 2 3 4 5 6 7 8 9 10 |
from transformers import VoxtralProcessor, VoxtralForConditionalGeneration import torch processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Small-24B-2507") model = VoxtralForConditionalGeneration.from_pretrained("mistralai/Voxtral-Small-24B-2507", device_map="auto", load_in_4bit=True) speech, sr = librosa.load("long_audio.mp3", sr=16000) inputs = processor(speech, sampling_rate=sr, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=512) print(processor.decode(outputs[0])) |
6. Limitations and Prospects
Several constraints and future directions are noted:
- Chunk Padding: 30-second padding is retained to prevent language-specific ASR degradation (e.g., +0.5% WER for French if removed).
- Downsampling Limit: >4× downsampling leads to significant WER increases for non-English audio.
- Preference Optimization Trade-Off: Online DPO improves speech understanding (SU) benchmarks (+1.7% judge score) but slightly increases English short-form WER (6.31% → 6.50%); future schedules may mitigate this.
- Open Directions: Planned enhancements include cross-chunk attention (removing fixed 30s boundaries), larger contexts via sparse/routing attention, integration of video and document images, and lighter backbone variants (sub-10B parameters) for even broader local deployment applicability.
Voxtral Small is the first open-weights model capable of simultaneously processing 40-minute audio and engaging in fluent, factually rigorous conversation across speech and text modalities, providing an open, locally runnable platform for multimodal research and application (Liu et al., 17 Jul 2025).