Voxtral Small: 24.3B Multimodal Transformer

Updated 14 February 2026

Voxtral Small is a 24.3B-parameter multimodal Transformer that integrates audio and text processing to transcribe, translate, and answer queries from up to 40 minutes of continuous audio.
Its architecture combines a frozen Whisper encoder, a two-layer audio-language adapter, and a Mistral Small decoder, achieving competitive benchmarks in ASR, translation, and speech understanding.
The model supports long-form audio processing with a 32K-token context window and is optimized for efficient deployment via Apache 2.0 licensing and 4-bit quantization.

Voxtral Small is a 24.3 billion-parameter multimodal Transformer model designed for integrated spoken audio and text understanding, featuring the ability to transcribe, translate, and answer open-domain queries about up to 40 minutes of continuous audio within a 32K-token context window. The architecture combines a state-of-the-art frozen Whisper large-v3 speech encoder, a two-layer audio-language adapter, and a Mistral Small 3.1 LLM decoder. Voxtral Small achieves state-of-the-art benchmarks across transcription, translation, and speech understanding, while maintaining highly competitive text-only performance. The model is released under the Apache 2.0 license, with public weights and source code.

1. Architecture and Parameterization

Voxtral Small follows a three-stage sequential pipeline:

Audio Encoder: A frozen Whisper large-v3 module (640M parameters) encodes 128-bin log-Mel spectrograms sampled at 50 Hz. Input waveforms are divided into non-overlapping 30-second windows, with positional encodings reset per chunk. Each chunk yields $E^{(i)} \in \mathbb{R}^{T_c \times d_{model}}$ where $T_c = 1500$ frames.
Audio-Language Adapter: A two-layer MLP (≈52M parameters) acts as both projector and temporal down-sampler. It applies $H = \mathrm{Gelu}(E W_1 + b_1) W_2 + b_2$ and selects every fourth frame along the time axis, reducing $T$ to $T' = T/4$ , e.g., $30{,}000 \rightarrow 7{,}500$ for a 40-minute input.
Language Decoder: An auto-regressive Transformer using the Mistral Small 3.1 24B backbone (≈22.9B parameters plus 670M in text token embeddings) handles multimodal inference and generation.

Parameter Breakdown

Component	Parameter Count	Description
Audio Encoder	640M	Frozen Whisper large-v3
Audio-Language Adapter	52M	2-layer MLP downsampler
Text Embeddings	670M	Token embedding layer
Language Decoder	22.9B	Mistral Small 3.1 backbone
Total	24.3B

The architecture is engineered to accommodate up to 40 minutes of audio and a few thousand text tokens within the 32K-model context window without significant performance loss (Liu et al., 17 Jul 2025).

2. Training Methodology and Objectives

Pretraining

Pretraining leverages a large corpus of (audio, transcript) pairs from public and proprietary sources, segmented via VAD. Each audio chunk $A_n$ is paired with $T_n$ (transcript, either annotated or pseudo-labeled). Two patterns are mixed with equal probability:

Audio-to-Text Repetition (<repeat> mode): Input is $[\texttt{<bos>}, A_n, \texttt{<repeat>}]$ with target $T_n$ , enforcing local alignment.
Cross-Modal Continuation (<next> mode): Input $[\texttt{<bos>}, A_1, T_2, \dots, A_{N-1}, \texttt{<next>}]$ , target $T_N$ , fostering discourse continuity.

The pretraining loss is:

$L_{pre} = \mathbb{E}_{\mathrm{pattern},(A,T)}\left[ -\sum_{t=1}^{|T|} \log P_\theta(T_t|\,\text{context}) \right]$

Pretraining initially restricts optimization to the adapter; Whisper and decoder are frozen. Text-only LLM pretraining (cross-entropy on text corpora) is interleaved to maintain strong unimodal performance.

Supervised Finetuning

Synthetic and real tasks spanning long-form audio (up to 40 minutes) are generated, including QA, summarization, and translation, utilizing TTS conversion to augment audio-only instruction datasets and reduce TTS-only bias. The objective remains token-wise cross-entropy. A special <transcribe> mode triggers verbatim ASR.

Preference Alignment (Online DPO)

Direct Preference Optimization (DPO) is applied in an online setting. Candidate responses $y^+, y^-$ are sampled (temperature $T=0.5$ ), scored by a reward model $R$ on ASR transcriptions, and the loss

$L_{DPO} = -\mathbb{E}_{(x, y^+, y^-)}\left[\log \sigma\left(\log \pi_\theta(y^+|x) - \log \pi_\theta(y^-|x)\right)\right]$

is optimized to sharpen output quality and reduce hallucinations.

3. Empirical Performance

Voxtral Small establishes new state-of-the-art results on a battery of open and closed speech and multimodal benchmarks:

Speech Recognition (WER %)

LibriSpeech Test Clean: 1.53% (vs. Whisper v3 1.84%, GPT-4o mini 1.92%)
LibriSpeech Test Other: 3.14% (vs. 3.66%, 4.70%)
Mozilla CV (non-Arabic avg): ≈5.7% (vs. 8.2%, 11.1%)
FLEURS (all languages avg): ≈5.1% (vs. 5.9%, 6.2%)

Speech Translation (BLEU)

en→de: 47.0 (vs. 44.5 GPT-4o mini, 44.6 Gemini 2.5)
en→fr: 57.3 (vs. 52.7, 53.9)
de→en: 56.6 (vs. 51.8, 39.4)
it→en: 46.8 (vs. 41.5, 31.8)

Speech Understanding Accuracy (%)

Average: 77.4%
Llama QA: 71.7% (vs. 74.3% GPT-4o mini, 66.3% Gemini 2.5)
Openbook QA: 88.4% (vs. 83.7%, 94.7%)
GSM8K: 89.7% (vs. 90.8%, 94.2%)
In-house SU: 86.6% (vs. 80.0%, 88.6%)

Text-only performance matches the Mistral 3.1 backbone to within statistical noise, enabling drop-in multimodal and unimodal operation (Liu et al., 17 Jul 2025).

4. Extended Context Window and Long-Form Audio Processing

Processing up to 40 minutes of audio within a single inference pass requires:

Chunk-Wise Encoder Attention: Each 30-second audio window is processed independently (complexity $O(N_c^2)$ , $N_c=1,500$ frames), and chunk embeddings are concatenated.
4× Temporal Downsampling: The adapter reduces token count by a factor of 4, achieving $T' = (D \cdot 50\,\text{Hz})/4 = D \cdot 12.5\,\text{Hz}$ . For $D=2,400$ s (40 min), $T' \approx 30,000$ , leaving nearly 2,000 text tokens—fitting in the 32K-token decoder window.

Total decoder memory scales as $O((T' + L_{text})^2)$ . This enables 40-minute audio plus conversation with no significant ASR/understanding loss compared to non-downsampled input. A plausible implication is that further downsampling (beyond 4×) would be detrimental, incurring >1% WER loss on non-English benchmarks.

5. Practical Usage, Licensing, and Deployment

License: Apache 2.0; no GPU-usage fees; weights and code hosted at https://huggingface.co/mistralai/Voxtral-Small-24B-2507.
Size/Memory: 24.3B parameters (~48 GB fp16). 4-bit quantization reduces RAM to ≈12 GB, enabling single-GPU (A6000/RTX 4090) hosting.
Recommended Hardware: ≥24 GB VRAM for quantized inference; ≥48 GB VRAM (two GPUs/model-parallel) for fp16. CPU-only possible but impractically slow.
PyTorch Inference Example:

from transformers import VoxtralProcessor, VoxtralForConditionalGeneration
import torch
processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Small-24B-2507")
model = VoxtralForConditionalGeneration.from_pretrained("mistralai/Voxtral-Small-24B-2507",
                                                       device_map="auto",
                                                       load_in_4bit=True)
speech, sr = librosa.load("long_audio.mp3", sr=16000)
inputs = processor(speech, sampling_rate=sr, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0]))

6. Limitations and Prospects

Several constraints and future directions are noted:

Chunk Padding: 30-second padding is retained to prevent language-specific ASR degradation (e.g., +0.5% WER for French if removed).
Downsampling Limit: >4× downsampling leads to significant WER increases for non-English audio.
Preference Optimization Trade-Off: Online DPO improves speech understanding (SU) benchmarks (+1.7% judge score) but slightly increases English short-form WER (6.31% → 6.50%); future schedules may mitigate this.
Open Directions: Planned enhancements include cross-chunk attention (removing fixed 30s boundaries), larger contexts via sparse/routing attention, integration of video and document images, and lighter backbone variants (sub-10B parameters) for even broader local deployment applicability.

Voxtral Small is the first open-weights model capable of simultaneously processing 40-minute audio and engaging in fluent, factually rigorous conversation across speech and text modalities, providing an open, locally runnable platform for multimodal research and application (Liu et al., 17 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Voxtral (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Voxtral Small.