Papers
Topics
Authors
Recent
Search
2000 character limit reached

GLM-4.5V: Advanced Multimodal Intelligence

Updated 27 January 2026
  • GLM-4.5V is a state-of-the-art multimodal system comprising parallel speech-language and vision-language architectures with unified token representations.
  • It employs innovative pre-training and reinforcement learning with curriculum sampling to optimize performance across speech recognition, visual grounding, and multimodal reasoning tasks.
  • The model demonstrates competitive results on industry benchmarks, highlighting its scalability, real-time processing capabilities, and potential for future AI advancements.

GLM-4.5V refers to two distinct, state-of-the-art open-source architectures developed in parallel by the THUDM and Zhipu AI research groups, targeting spoken-language intelligence (Zeng et al., 2024) and advanced multimodal vision–language reasoning (Team et al., 1 Jul 2025), respectively. Both models are positioned at the forefront of large-scale, instruction-tuned multimodal systems and exemplify trends in scalable model design, unified token representations, and reinforcement learning-driven optimization for diverse downstream tasks.

1. Model Architectures

Speech-Language GLM-4.5V

The THUDM GLM-4.5V is built on the GLM-4-9B-Base 9B-parameter text Transformer, augmented with an ultra-low bitrate (175 bps), single-codebook speech token vocabulary and ancillary modules for end-to-end, real-time spoken dialogue (Zeng et al., 2024). The architecture introduces a VQ-VAE bottleneck at the mid-layer of a Whisper-Large-V3 ASR encoder, reducing frame rate from 50 Hz to 12.5 Hz and mapping each audio segment (80 ms at 16 kHz) to a discrete token ID. The backbone’s input embedding layer is extended for speech-token IDs; the main Transformer (attention, FFNs) is unmodified, yielding bidirectional capacity for cross-modal token streams.

Generation alternates between autoregressive text and speech token predictions within the unified Transformer backbone, enabling joint modeling of {wt}\{w_t\} (text) and {st}\{s_t\} (speech):

LLM=t=1TlogP(xtx<t),L_{\text{LM}} = -\sum_{t=1}^T \log P(x_t | x_{<t}),

where xtx_t ranges over the union of text and speech tokens.

Vision-Language GLM-4.5V

The Zhipu AI GLM-4.5V (sometimes styled GLM-4.5V-AIR) incorporates an AIMv2-Huge ViT vision encoder (supplemented with 3D convolutions for temporally-aware video processing, and RoPE for positional encoding) connected to a Mixture-of-Experts (MoE) language head with ~106B total, ~12B activated parameters (12B-A MoE). Visual features are projected via an MLP adapter into the LLM’s token space (Team et al., 1 Jul 2025).

Key architectural features:

  • Flexible support for both chain-of-thought ("thinking") and flat response generation via /nothink mode tokens.
  • Temporal grounding in video streams using time-index tokens post-frame.
  • Extended 3D-RoPE in the LLM head for improved spatial reasoning.
  • Absolute positional embeddings can be interpolated for variable-resolution/image aspect ratios.
  • Full-context modeling up to 8,192 tokens (32K–128K for GLM-4.6V).

2. Pre-training Strategies

Speech-Language (THUDM)

Data sourcing and alignment use a combination of:

  • Interleaved synthetic speech–text data via a text-to-token TTS model (∼70%)
  • Unsupervised real speech (∼10%)
  • Paired ASR/TTS supervised data (∼5%)
  • Text-only corpora (∼15%)
  • Yielding a total pre-training volume of ~101210^{12} tokens.

The training objective leverages joint sequences of the form:

[,<speech>,s1,,sm,</speech>,wi+1,][\dots, \texttt{<speech>}, s_1, \dots, s_m, \texttt{</speech>}, w_{i+1}, \dots]

Optimization employs AdamW, warmup followed by linear decay from 6×1056 \times 10^{-5} to 6×1066 \times 10^{-6}, 8,192 token context, and large aggregate batches.

Vision-Language (Zhipu AI)

Pre-training draws on billions of multimodal samples:

  • Image-caption pairs (filtered, concept-balanced)
  • Interleaved web/academic multimodal text (MINT, MMC4, OmniCorpus)
  • OCR datasets (synthetic + real), grounding annotations, GUI screenshots
  • Video-caption corpora, processed with human metadata verification
  • 50M mixed-format instruction-tuning samples

Autoregressive next-token modeling is applied over interleaved visual and text tokens:

LLM=t=1Tlogpθ(xtx<t,V)L_{\text{LM}} = -\sum_{t=1}^T \log p_\theta(x_t | x_{<t}, V)

with an auxiliary MoE load-balancing loss:

Lbalance=λe=1E(ueN1E)2,λ=104L_{\text{balance}} = \lambda \sum_{e=1}^E \Bigl( \tfrac{u_e}{N} - \tfrac{1}{E} \Bigr)^2, \quad \lambda=10^{-4}

3. Reinforcement Learning with Curriculum Sampling (RLCS)

GLM-4.5V employs RLCS to optimize multimodal reasoning. Difficulty tiers—stratified based on pass@k scores—drive sample selection via a smooth exponential-weighted policy targeting items within the model’s learning frontier:

pt(k)    exp ⁣(dmodel(k)doff(k)τ)p_t(k)\;\propto\;\exp\!\Bigl(-\tfrac{|d_{\mathrm{model}}(k)-d_{\mathrm{off}}(k)|}{\tau}\Bigr)

where doffd_{\mathrm{off}} is the offline difficulty and dmodeld_{\mathrm{model}} the current competence. The RLCS loop evaluates, rewards, and re-weights rollout samples based on:

  • STEM QA: r=I[boxed_answer=gt]r=\mathbb{I}[\text{boxed\_answer}=\text{gt}]
  • OCR: r=1dedit(ans,gt)max(ans,gt)r=1-\tfrac{d_{\text{edit}}(\text{ans},\text{gt})}{\max(|\text{ans}|,|\text{gt}|)}
  • Grounding: r=#{IoU>τ}#boxesr=\tfrac{\#\{\mathrm{IoU}>\tau\}}{\#\mathrm{boxes}}

This approach yields a +3–5% performance uplift versus random sampling and mitigates RL instability by dynamically matching curriculum to model competence (Team et al., 1 Jul 2025).

4. Evaluation and Capabilities

Speech-Language Tasks

GLM-4.5V attains state-of-the-art or highly competitive results in end-to-end ASR/TTS, spoken QA, and conversational assessment:

Model LibriSpeech WER (%) AISHELL-1 CER (%) Topic-StoryCloze (S→T, %) GenQA / UTMOS
Whisper-Large-V3 2.50 / 4.53 9.31
CosyVoice — / —
GLM-4.5V 2.82 / 7.66 2.46 93.6 5.40 / 4.45

On benchmarks such as StoryCloze (76.3%) and TriviaQA (39.1%), performance exceeds comparably sized models. Human-rated General/Knowledge QA, and UTMOS (4.45), confirm competitive conversational naturalness (Zeng et al., 2024).

Vision-Language and Multimodal Reasoning

GLM-4.5V demonstrates advanced competence across 42 vision–language benchmarks, including STEM (65.2% on MMMU Pro), GUI agents (35.8% OSWorld), coding (82.2% Design2Code), video reasoning (80.7% Video MMMU with subtitles), and visual grounding (91.3% RefCOCO-avg):

Category Benchmark GLM-4.5V Qwen2.5-VL-72B Gemini-2.5-Flash
STEM MMMU Pro 65.2% 51.1%
GUI Agents OSWorld 35.8% 8.8%
Coding Design2Code 82.2% 41.9%
Video MMMU w/subtitles 80.7% 79.1%
Visual Grounding RefCOCO-avg 91.3% 90.3%

Ablations show +1–2% gains in multimodal RL stability and accuracy by omitting KL loss. Across evaluated tasks, the model matches or outperforms closed-source Gemini-2.5-Flash in 22 out of 42 cases (Team et al., 1 Jul 2025).

5. Inference Pipeline and Usage

Speech-Language

  • Real-time streaming: user audio is discretized by the speech tokenizer, processed by the unified Transformer in an alternating text/speech-token regime (typically 13 text : 26 speech tokens).
  • Voice control: Special span tokens in the prompt for emotion (<style=happy>), speech rate (<rate=1.2>), and dialect (<dialect=Shanghai>).
  • Output: Text may be display-bypassed or passed to a HiFiGAN-based waveform decoder for speech synthesis. Latency is bounded by an initial block size (b=0.8b=0.8 s).
  • Open-source code and checkpoints via GitHub and Hugging Face (Zeng et al., 2024).

Vision-Language

  • Accepts arbitrarily sized images or videos (resolution/aspect-agnostic) via bicubic-interpolated absolute embeddings and time-indexed video tokens.
  • Sequence lengths up to 8,192 tokens (extending to 131,072 for GLM-4.6V).
  • vLLM and SGLang backends for text/video, with planned tool-invocation structured via XML spans.
  • Released resources and documentation maintained at https://github.com/zai-org/GLM-V (Team et al., 1 Jul 2025).

6. Limitations and Future Development

Key limitations:

  • RL with verifier rewards does not enforce coherence of output reasoning, potentially generating correct answers with erroneous or underspecified intermediate chains.
  • RL instability may occur with poor-quality rewards or data; entropy and KL losses require careful tuning to avoid collapse.
  • Visual models exhibit error propensity in occluded/cluttered scenes, and tool-use hallucinations may arise.
  • Resource concerns in RL: scale, compute, and web-scale data biases pose practical and ethical challenges.

Proposed directions under investigation:

A plausible implication is that holistic, curriculum-driven RL, unified token spaces, and explicit alignment strategies represent convergent design trends for future large-scale, open-source multimodal models.

References

  • "GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot" (Zeng et al., 2024)
  • "GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning" (Team et al., 1 Jul 2025)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GLM-4.5V.