GLM-4.5V: Advanced Multimodal Intelligence
- GLM-4.5V is a state-of-the-art multimodal system comprising parallel speech-language and vision-language architectures with unified token representations.
- It employs innovative pre-training and reinforcement learning with curriculum sampling to optimize performance across speech recognition, visual grounding, and multimodal reasoning tasks.
- The model demonstrates competitive results on industry benchmarks, highlighting its scalability, real-time processing capabilities, and potential for future AI advancements.
GLM-4.5V refers to two distinct, state-of-the-art open-source architectures developed in parallel by the THUDM and Zhipu AI research groups, targeting spoken-language intelligence (Zeng et al., 2024) and advanced multimodal vision–language reasoning (Team et al., 1 Jul 2025), respectively. Both models are positioned at the forefront of large-scale, instruction-tuned multimodal systems and exemplify trends in scalable model design, unified token representations, and reinforcement learning-driven optimization for diverse downstream tasks.
1. Model Architectures
Speech-Language GLM-4.5V
The THUDM GLM-4.5V is built on the GLM-4-9B-Base 9B-parameter text Transformer, augmented with an ultra-low bitrate (175 bps), single-codebook speech token vocabulary and ancillary modules for end-to-end, real-time spoken dialogue (Zeng et al., 2024). The architecture introduces a VQ-VAE bottleneck at the mid-layer of a Whisper-Large-V3 ASR encoder, reducing frame rate from 50 Hz to 12.5 Hz and mapping each audio segment (80 ms at 16 kHz) to a discrete token ID. The backbone’s input embedding layer is extended for speech-token IDs; the main Transformer (attention, FFNs) is unmodified, yielding bidirectional capacity for cross-modal token streams.
Generation alternates between autoregressive text and speech token predictions within the unified Transformer backbone, enabling joint modeling of (text) and (speech):
where ranges over the union of text and speech tokens.
Vision-Language GLM-4.5V
The Zhipu AI GLM-4.5V (sometimes styled GLM-4.5V-AIR) incorporates an AIMv2-Huge ViT vision encoder (supplemented with 3D convolutions for temporally-aware video processing, and RoPE for positional encoding) connected to a Mixture-of-Experts (MoE) language head with ~106B total, ~12B activated parameters (12B-A MoE). Visual features are projected via an MLP adapter into the LLM’s token space (Team et al., 1 Jul 2025).
Key architectural features:
- Flexible support for both chain-of-thought ("thinking") and flat response generation via
/nothinkmode tokens. - Temporal grounding in video streams using time-index tokens post-frame.
- Extended 3D-RoPE in the LLM head for improved spatial reasoning.
- Absolute positional embeddings can be interpolated for variable-resolution/image aspect ratios.
- Full-context modeling up to 8,192 tokens (32K–128K for GLM-4.6V).
2. Pre-training Strategies
Speech-Language (THUDM)
Data sourcing and alignment use a combination of:
- Interleaved synthetic speech–text data via a text-to-token TTS model (∼70%)
- Unsupervised real speech (∼10%)
- Paired ASR/TTS supervised data (∼5%)
- Text-only corpora (∼15%)
- Yielding a total pre-training volume of ~ tokens.
The training objective leverages joint sequences of the form:
Optimization employs AdamW, warmup followed by linear decay from to , 8,192 token context, and large aggregate batches.
Vision-Language (Zhipu AI)
Pre-training draws on billions of multimodal samples:
- Image-caption pairs (filtered, concept-balanced)
- Interleaved web/academic multimodal text (MINT, MMC4, OmniCorpus)
- OCR datasets (synthetic + real), grounding annotations, GUI screenshots
- Video-caption corpora, processed with human metadata verification
- 50M mixed-format instruction-tuning samples
Autoregressive next-token modeling is applied over interleaved visual and text tokens:
with an auxiliary MoE load-balancing loss:
3. Reinforcement Learning with Curriculum Sampling (RLCS)
GLM-4.5V employs RLCS to optimize multimodal reasoning. Difficulty tiers—stratified based on pass@k scores—drive sample selection via a smooth exponential-weighted policy targeting items within the model’s learning frontier:
where is the offline difficulty and the current competence. The RLCS loop evaluates, rewards, and re-weights rollout samples based on:
- STEM QA:
- OCR:
- Grounding:
This approach yields a +3–5% performance uplift versus random sampling and mitigates RL instability by dynamically matching curriculum to model competence (Team et al., 1 Jul 2025).
4. Evaluation and Capabilities
Speech-Language Tasks
GLM-4.5V attains state-of-the-art or highly competitive results in end-to-end ASR/TTS, spoken QA, and conversational assessment:
| Model | LibriSpeech WER (%) | AISHELL-1 CER (%) | Topic-StoryCloze (S→T, %) | GenQA / UTMOS |
|---|---|---|---|---|
| Whisper-Large-V3 | 2.50 / 4.53 | 9.31 | — | — |
| CosyVoice | — / — | — | — | — |
| GLM-4.5V | 2.82 / 7.66 | 2.46 | 93.6 | 5.40 / 4.45 |
On benchmarks such as StoryCloze (76.3%) and TriviaQA (39.1%), performance exceeds comparably sized models. Human-rated General/Knowledge QA, and UTMOS (4.45), confirm competitive conversational naturalness (Zeng et al., 2024).
Vision-Language and Multimodal Reasoning
GLM-4.5V demonstrates advanced competence across 42 vision–language benchmarks, including STEM (65.2% on MMMU Pro), GUI agents (35.8% OSWorld), coding (82.2% Design2Code), video reasoning (80.7% Video MMMU with subtitles), and visual grounding (91.3% RefCOCO-avg):
| Category | Benchmark | GLM-4.5V | Qwen2.5-VL-72B | Gemini-2.5-Flash |
|---|---|---|---|---|
| STEM | MMMU Pro | 65.2% | 51.1% | — |
| GUI Agents | OSWorld | 35.8% | 8.8% | — |
| Coding | Design2Code | 82.2% | 41.9% | — |
| Video MMMU | w/subtitles | 80.7% | 79.1% | — |
| Visual Grounding | RefCOCO-avg | 91.3% | 90.3% | — |
Ablations show +1–2% gains in multimodal RL stability and accuracy by omitting KL loss. Across evaluated tasks, the model matches or outperforms closed-source Gemini-2.5-Flash in 22 out of 42 cases (Team et al., 1 Jul 2025).
5. Inference Pipeline and Usage
Speech-Language
- Real-time streaming: user audio is discretized by the speech tokenizer, processed by the unified Transformer in an alternating text/speech-token regime (typically 13 text : 26 speech tokens).
- Voice control: Special span tokens in the prompt for emotion (
<style=happy>), speech rate (<rate=1.2>), and dialect (<dialect=Shanghai>). - Output: Text may be display-bypassed or passed to a HiFiGAN-based waveform decoder for speech synthesis. Latency is bounded by an initial block size ( s).
- Open-source code and checkpoints via GitHub and Hugging Face (Zeng et al., 2024).
Vision-Language
- Accepts arbitrarily sized images or videos (resolution/aspect-agnostic) via bicubic-interpolated absolute embeddings and time-indexed video tokens.
- Sequence lengths up to 8,192 tokens (extending to 131,072 for GLM-4.6V).
- vLLM and SGLang backends for text/video, with planned tool-invocation structured via XML spans.
- Released resources and documentation maintained at https://github.com/zai-org/GLM-V (Team et al., 1 Jul 2025).
6. Limitations and Future Development
Key limitations:
- RL with verifier rewards does not enforce coherence of output reasoning, potentially generating correct answers with erroneous or underspecified intermediate chains.
- RL instability may occur with poor-quality rewards or data; entropy and KL losses require careful tuning to avoid collapse.
- Visual models exhibit error propensity in occluded/cluttered scenes, and tool-use hallucinations may arise.
- Resource concerns in RL: scale, compute, and web-scale data biases pose practical and ethical challenges.
Proposed directions under investigation:
- Reward models that assess intermediate chain-of-thought steps for contradiction and hallucination.
- Adversarial and subset-based validation to detect RL reward gaming.
- Diagnostic multimodal benchmarks for hallucination and multi-step planning (Team et al., 1 Jul 2025).
A plausible implication is that holistic, curriculum-driven RL, unified token spaces, and explicit alignment strategies represent convergent design trends for future large-scale, open-source multimodal models.
References
- "GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot" (Zeng et al., 2024)
- "GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning" (Team et al., 1 Jul 2025)