GLM-4.5V: Advanced Multimodal Intelligence

Updated 27 January 2026

GLM-4.5V is a state-of-the-art multimodal system comprising parallel speech-language and vision-language architectures with unified token representations.
It employs innovative pre-training and reinforcement learning with curriculum sampling to optimize performance across speech recognition, visual grounding, and multimodal reasoning tasks.
The model demonstrates competitive results on industry benchmarks, highlighting its scalability, real-time processing capabilities, and potential for future AI advancements.

GLM-4.5V refers to two distinct, state-of-the-art open-source architectures developed in parallel by the THUDM and Zhipu AI research groups, targeting spoken-language intelligence (Zeng et al., 2024) and advanced multimodal vision–language reasoning (Team et al., 1 Jul 2025), respectively. Both models are positioned at the forefront of large-scale, instruction-tuned multimodal systems and exemplify trends in scalable model design, unified token representations, and reinforcement learning-driven optimization for diverse downstream tasks.

1. Model Architectures

Speech-Language GLM-4.5V

The THUDM GLM-4.5V is built on the GLM-4-9B-Base 9B-parameter text Transformer, augmented with an ultra-low bitrate (175 bps), single-codebook speech token vocabulary and ancillary modules for end-to-end, real-time spoken dialogue (Zeng et al., 2024). The architecture introduces a VQ-VAE bottleneck at the mid-layer of a Whisper-Large-V3 ASR encoder, reducing frame rate from 50 Hz to 12.5 Hz and mapping each audio segment (80 ms at 16 kHz) to a discrete token ID. The backbone’s input embedding layer is extended for speech-token IDs; the main Transformer (attention, FFNs) is unmodified, yielding bidirectional capacity for cross-modal token streams.

Generation alternates between autoregressive text and speech token predictions within the unified Transformer backbone, enabling joint modeling of $\{w_t\}$ (text) and $\{s_t\}$ (speech):

$L_{\text{LM}} = -\sum_{t=1}^T \log P(x_t | x_{<t}),$

where $x_t$ ranges over the union of text and speech tokens.

Vision-Language GLM-4.5V

The Zhipu AI GLM-4.5V (sometimes styled GLM-4.5V-AIR) incorporates an AIMv2-Huge ViT vision encoder (supplemented with 3D convolutions for temporally-aware video processing, and RoPE for positional encoding) connected to a Mixture-of-Experts (MoE) language head with ~106B total, ~12B activated parameters (12B-A MoE). Visual features are projected via an MLP adapter into the LLM’s token space (Team et al., 1 Jul 2025).

Key architectural features:

Flexible support for both chain-of-thought ("thinking") and flat response generation via /nothink mode tokens.
Temporal grounding in video streams using time-index tokens post-frame.
Extended 3D-RoPE in the LLM head for improved spatial reasoning.
Absolute positional embeddings can be interpolated for variable-resolution/image aspect ratios.
Full-context modeling up to 8,192 tokens (32K–128K for GLM-4.6V).

2. Pre-training Strategies

Speech-Language (THUDM)

Data sourcing and alignment use a combination of:

Interleaved synthetic speech–text data via a text-to-token TTS model (∼70%)
Unsupervised real speech (∼10%)
Paired ASR/TTS supervised data (∼5%)
Text-only corpora (∼15%)
Yielding a total pre-training volume of ~ $10^{12}$ tokens.

The training objective leverages joint sequences of the form:

$[\dots, \texttt{<speech>}, s_1, \dots, s_m, \texttt{</speech>}, w_{i+1}, \dots]$

Optimization employs AdamW, warmup followed by linear decay from $6 \times 10^{-5}$ to $6 \times 10^{-6}$ , 8,192 token context, and large aggregate batches.

Vision-Language (Zhipu AI)

Pre-training draws on billions of multimodal samples:

Image-caption pairs (filtered, concept-balanced)
Interleaved web/academic multimodal text (MINT, MMC4, OmniCorpus)
OCR datasets (synthetic + real), grounding annotations, GUI screenshots
Video-caption corpora, processed with human metadata verification
50M mixed-format instruction-tuning samples

Autoregressive next-token modeling is applied over interleaved visual and text tokens:

$L_{\text{LM}} = -\sum_{t=1}^T \log p_\theta(x_t | x_{<t}, V)$

with an auxiliary MoE load-balancing loss:

$L_{\text{balance}} = \lambda \sum_{e=1}^E \Bigl( \tfrac{u_e}{N} - \tfrac{1}{E} \Bigr)^2, \quad \lambda=10^{-4}$

3. Reinforcement Learning with Curriculum Sampling (RLCS)

GLM-4.5V employs RLCS to optimize multimodal reasoning. Difficulty tiers—stratified based on pass@k scores—drive sample selection via a smooth exponential-weighted policy targeting items within the model’s learning frontier:

$p_t(k)\;\propto\;\exp\!\Bigl(-\tfrac{|d_{\mathrm{model}}(k)-d_{\mathrm{off}}(k)|}{\tau}\Bigr)$

where $d_{\mathrm{off}}$ is the offline difficulty and $d_{\mathrm{model}}$ the current competence. The RLCS loop evaluates, rewards, and re-weights rollout samples based on:

STEM QA: $r=\mathbb{I}[\text{boxed\_answer}=\text{gt}]$
OCR: $r=1-\tfrac{d_{\text{edit}}(\text{ans},\text{gt})}{\max(|\text{ans}|,|\text{gt}|)}$
Grounding: $r=\tfrac{\#\{\mathrm{IoU}>\tau\}}{\#\mathrm{boxes}}$

This approach yields a +3–5% performance uplift versus random sampling and mitigates RL instability by dynamically matching curriculum to model competence (Team et al., 1 Jul 2025).

4. Evaluation and Capabilities

Speech-Language Tasks

GLM-4.5V attains state-of-the-art or highly competitive results in end-to-end ASR/TTS, spoken QA, and conversational assessment:

Model	LibriSpeech WER (%)	AISHELL-1 CER (%)	Topic-StoryCloze (S→T, %)	GenQA / UTMOS
Whisper-Large-V3	2.50 / 4.53	9.31	—	—
CosyVoice	— / —	—	—	—
GLM-4.5V	2.82 / 7.66	2.46	93.6	5.40 / 4.45

On benchmarks such as StoryCloze (76.3%) and TriviaQA (39.1%), performance exceeds comparably sized models. Human-rated General/Knowledge QA, and UTMOS (4.45), confirm competitive conversational naturalness (Zeng et al., 2024).

Vision-Language and Multimodal Reasoning

GLM-4.5V demonstrates advanced competence across 42 vision–language benchmarks, including STEM (65.2% on MMMU Pro), GUI agents (35.8% OSWorld), coding (82.2% Design2Code), video reasoning (80.7% Video MMMU with subtitles), and visual grounding (91.3% RefCOCO-avg):

Category	Benchmark	GLM-4.5V	Qwen2.5-VL-72B	Gemini-2.5-Flash
STEM	MMMU Pro	65.2%	51.1%	—
GUI Agents	OSWorld	35.8%	8.8%	—
Coding	Design2Code	82.2%	41.9%	—
Video MMMU	w/subtitles	80.7%	79.1%	—
Visual Grounding	RefCOCO-avg	91.3%	90.3%	—

Ablations show +1–2% gains in multimodal RL stability and accuracy by omitting KL loss. Across evaluated tasks, the model matches or outperforms closed-source Gemini-2.5-Flash in 22 out of 42 cases (Team et al., 1 Jul 2025).

5. Inference Pipeline and Usage

Speech-Language

Real-time streaming: user audio is discretized by the speech tokenizer, processed by the unified Transformer in an alternating text/speech-token regime (typically 13 text : 26 speech tokens).
Voice control: Special span tokens in the prompt for emotion (<style=happy>), speech rate (<rate=1.2>), and dialect (<dialect=Shanghai>).
Output: Text may be display-bypassed or passed to a HiFiGAN-based waveform decoder for speech synthesis. Latency is bounded by an initial block size ( $b=0.8$ s).
Open-source code and checkpoints via GitHub and Hugging Face (Zeng et al., 2024).

Vision-Language

Accepts arbitrarily sized images or videos (resolution/aspect-agnostic) via bicubic-interpolated absolute embeddings and time-indexed video tokens.
Sequence lengths up to 8,192 tokens (extending to 131,072 for GLM-4.6V).
vLLM and SGLang backends for text/video, with planned tool-invocation structured via XML spans.
Released resources and documentation maintained at https://github.com/zai-org/GLM-V (Team et al., 1 Jul 2025).

6. Limitations and Future Development

Key limitations:

RL with verifier rewards does not enforce coherence of output reasoning, potentially generating correct answers with erroneous or underspecified intermediate chains.
RL instability may occur with poor-quality rewards or data; entropy and KL losses require careful tuning to avoid collapse.
Visual models exhibit error propensity in occluded/cluttered scenes, and tool-use hallucinations may arise.
Resource concerns in RL: scale, compute, and web-scale data biases pose practical and ethical challenges.

Proposed directions under investigation:

Reward models that assess intermediate chain-of-thought steps for contradiction and hallucination.
Adversarial and subset-based validation to detect RL reward gaming.
Diagnostic multimodal benchmarks for hallucination and multi-step planning (Team et al., 1 Jul 2025).

A plausible implication is that holistic, curriculum-driven RL, unified token spaces, and explicit alignment strategies represent convergent design trends for future large-scale, open-source multimodal models.

References

"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot" (Zeng et al., 2024)
"GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning" (Team et al., 1 Jul 2025)

Markdown Report Issue Upgrade to Chat

References (2)

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot (2024)

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GLM-4.5V.