Papers
Topics
Authors
Recent
Search
2000 character limit reached

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Published 1 Jul 2025 in cs.CV, cs.AI, and cs.LG | (2507.01006v6)

Abstract: We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-LLMs (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at https://z.ai/blog/glm-4.6v. Code, models and more information are released at https://github.com/zai-org/GLM-V.

Summary

  • The paper introduces a scalable reinforcement learning framework that enables dual 'thinking' and 'non-thinking' inference in unified vision–language models.
  • It employs a transformer-based architecture with ViT encoding, MLP visual projection, and chain-of-thought capabilities to support multimodal reasoning over images and videos.
  • The models achieve state-of-the-art results on 42 benchmarks, demonstrating superior cross-domain generalization and parameter efficiency.

Versatile Multimodal Reasoning via Scalable Reinforcement Learning: The GLM-4.5V and GLM-4.1V-Thinking Models

Model Design and Multimodal Architecture

The GLM-V series (GLM-4.1V-Thinking, GLM-4.5V, GLM-4.6V) embodies a highly unified transformer-based vision-language architecture optimized for scalable multimodal reasoning and generalization. The architectural core comprises a ViT encoder (AIMv2-Huge initialization), an MLP-based visual projector, and a LLM which acts as a unified multimodal decoder. Video input support is natively integrated through 3D convolutions and time index token encoding, allowing the model to retain temporal coherence and operate on videos at native resolutions and aspect ratios. 2D-RoPE and 3D-RoPE modifications enable spatial and temporal reasoning over variable aspect ratios and high-resolution images, while bicubic interpolation adapts absolute position embeddings for arbitrary input shapes. Figure 1

Figure 1: Shared GLM-V architecture: ViT encoder, MLP projector, and large language decoder provide unified image/video/text processing and multimodal reasoning, with explicit temporal encoding for videos.

The GLM-4.5V (106B-A12B MoE model) and GLM-4.1V-Thinking (9B dense model) instantiate these capabilities at two different computational scales, with GLM-4.5V supporting dual “thinking” and “non-thinking” output modes via prompt flags, facilitating both efficient and chain-of-thought (CoT) inference.

Data Curation for Broad Reasoning and Robust Perception

Substantial effort is devoted to data curation across multiple axes of vision-language capability. Large-scale, high-quality image-caption data is filtered using heuristics, CLIP-based relevance, concept-balanced resampling, and factual recaptioning (see Figure 2), producing a corpus with minimal hallucination and enhanced density of factual content. Figure 2

Figure 2: Recaptioning pipeline outputs more factual and less noisy image-text pairs, mitigating the effect of web-scale hallucinations.

Interleaved image-text document alignment leverages both web and academic corpora with custom pipelines targeting high-information-density content (scientific illustrations, diagrams, GUI screenshots, etc.). OCR data is synthesized and mined from synthetic renderings, natural scenes with Paddle-OCR, and structured academic documents. Visual grounding is supported by 40M LAION-GLIPv2 natural image annotations and 140M GUI-specific question-answer pairs, and video data is agglomerated, filtered, and de-duplicated with a human-in-the-loop approach for compositionality and caption faithfulness.

Domain-specific instruction and long-chain-of-thought formats are meticulously enforced for RL cold start, using special tags for answer boxing and reasoning, improving parsing and reward precision, and enabling verifiable reasoning evaluations. The corpus covers general visual reasoning, STEM, document analysis, GUI agent trajectories, and more, maximizing cross-domain interaction during RL.

Supervised Fine-Tuning and Dual-Mode Inference

Supervised fine-tuning (SFT) bridges foundation pretraining and RL, strategically switching from knowledge-injection to reasoning-style alignment. For verifiable and open-ended tasks, SFT datasets are strictly formatted for explicit reasoning/answer demarcation, and mixture-of-experts architectures are used with rigorous parallelism and balance optimization. Both multimodal and text-only long-form reasoning data are used to maintain language and cross-modal consistency.

Distinctively, GLM-4.5V/4.6V support explicit “thinking” and “non-thinking” inference by prompt flag, allowing resource-constrained, high-throughput decoding as well as CoT-style traceable reasoning.

Reinforcement Learning with Curriculum Sampling (RLCS): Data, Reward, and Stability

The paper’s central technical contribution is a reinforcement learning recipe that scales across multimodal domains without catastrophic forgetting or reward hacking. RLCS combines curriculum learning with adaptive, online and offline sample difficulty ranking and dynamic expansion ratios (using EMA tracking), filtering out low or over-hard samples during rollouts and maximizing meaningful gradient throughput.

A comprehensive domain-specific reward infrastructure is constructed (see Figure 3), using extraction policies (rule-based/LLM-extraction), content- and style-aware validators, and task-specific checking (numeric, textual, action, reasoning structure, etc). The reward pipeline is engineered for robustness: flaws in a single domain’s verifier can poison or collapse RL training, as empirically demonstrated. Figure 3

Figure 3: Reward hacking in multimodal RL: noisy or insecure verifiers cause reward inflation and performance collapse despite progress in other sub-domains, necessitating resistant verifiers for stable RL.

RL training employs GRPO optimization with the KL and entropy penalties disabled, and applies large batches with rollouts dynamically packed and sequenced to maximize compute utilization. Per-sample loss yields higher stability than per-token, and top-p sampling with p=1p=1 is empirically optimal for RL consistency.

Cross-Domain Generalization and Mutual Capability Transfer

Empirical results show extensive cross-domain transfer and mutual benefit during RL, as visualized in Figure 4. Training in one domain, such as STEM, results in measurable improvement in disjoint tasks such as visual grounding and general VQA; joint (mix-all) training yields further increases except in domains with extreme specificity (grounding, GUI-agent), where targeted multi-domain schedules remain necessary. Figure 4

Figure 4: RL-induced cross-domain generalization: single-domain RL yields positive performance transfer, while mixed RL maximizes collective advancement, visualized as performance deltas per domain.

Further, GLM-4.5V’s use of scalable RL, robust reward design, and comprehensive domain scheduling achieves strong mutual reinforcement and capability alignment, a prerequisite for deploying universal multimodal reasoning agents.

Benchmarking and Numerical Results

GLM-4.5V and GLM-4.1V-Thinking are comprehensively benchmarked on 42 tasks across VQA, STEM (MMMU, MathVista, LogicVista), OCR/chart/document, long-document understanding, grounding, spatial reasoning, GUI agents, coding (Design2Code, Flame-React-Eval), and video understanding. On nearly all open benchmarks, GLM-4.5V outperforms state-of-the-art open-source models such as Qwen2.5-VL-72B, Step-3-321B, InternVL3, and Kimi-VL—even matching or exceeding closed-source Gemini-2.5-Flash on 22 benchmarks (see Figure 5). Figure 5

Figure 5: GLM-4.5V achieves near-linear scaling from 9B to 106B, outperforming prior open models and matching closed-source Gemini-2.5-Flash on challenging benchmarks.

Notably, GLM-4.1V-9B-Thinking, despite only 9B parameters, outperforms Qwen2.5-VL-72B (72B parameters) on 29 out of 42 tasks, demonstrating the efficacy of the architecture and training regime for parameter efficiency. Case studies in the appendix show detailed qualitative reasoning traces on challenging tasks: GUI recognition and planning (Figure 6), front-end code generation from UI (Figure 7), video scene and action interpretation (Figures 8–10), chart QA (Figure 8), spatial and document reasoning (Figures 17–18), and code debugging (Figure 9).

Limitations and Perspective

Despite the robust capabilities, GLM-4.5V inherits persistent issues of current VLMs: (1) reward models for RL only assess outcomes, not the reasoning path, thus permitting “correct answers” with hallucinated justification; (2) RL-induced instabilities remain, though mitigated via reward and data pipeline improvements; (3) perceptual bottlenecks, especially in cluttered or ambiguous inputs, can still undermine downstream reasoning. Addressing intermediate-chain reward modeling and adversarial reward hacking remains an open challenge.

From a broader perspective, the demonstrated positive transfer in RL across modalities suggests new directions for joint training and “capability bootstrapping”—eyeing hybrid workflows where, e.g., visual coding tasks could improve text-only coding generalization. As model performance saturates mainstream benchmarks, diagnostic datasets explicitly targeting reasoning chain hallucination, reward shortcutting, and robust, grounded inference are critical future milestones.

Conclusion

GLM-4.5V and GLM-4.1V-Thinking, as presented, expand the frontier of open vision-LLMs under a unified reasoning-centric RL regime. By integrating robust architectural innovations, principled data curation, comprehensive reward engineering, and curriculum-informed RL, these models deliver superior parameter efficiency, strong cross-domain generalization, and state-of-the-art results on a broad spectrum of multimodal benchmarks. These findings motivate future research on reward design for reasoning-path supervision, adversarial robustness in RL, and continuously scalable, alignment-aware multimodal models.


Reference: "GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning" (2507.01006)

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper introduces GLM-4.1V-Thinking, a “vision-language” AI model. That means it can look at pictures or videos (vision) and read or write text (language) at the same time, then use both to reason and solve problems. The team’s main goal is to make the model not just see and read, but also “think” better—like solving science and math problems, understanding long documents, reading charts, using websites, and even writing code.

What questions were the researchers trying to answer?

The researchers focused on a few big questions, explained simply:

  • How can we train an AI to reason across many kinds of tasks—like math, science, charts, documents, videos, and GUIs (web/app screens)—instead of just recognizing what’s in a picture?
  • Can a smaller, open model perform as well as, or better than, much larger models on tough tasks?
  • How do we use feedback (like a coach’s grading) to steadily improve the AI’s reasoning without it learning bad shortcuts?

How did they build and train the model?

Think of the model like a team:

  • The “eyes”: a vision encoder that turns images and videos into useful signals.
  • A “translator”: a small adapter that helps visual signals talk to the language part.
  • The “brain”: a LLM that understands and reasons with combined visual and text information.

To make the model strong and versatile, the team trained it in three main stages:

1) Pre-training: building a strong foundation

They fed the model massive amounts of carefully cleaned and balanced multimodal data:

  • Image–text pairs with accurate captions (they filtered low-quality data and even “recaptioned” noisy descriptions to be clearer and more factual).
  • Interleaved image–text pages from websites and books (like real documents where text and images appear together).
  • OCR data (images with text), including synthetic documents and real-world photos with text on signs or pages.
  • Grounding data (teaching the model to point to exactly where something is in an image or a user interface).
  • Video data with fine-grained notes about actions, camera motion, and text in scenes.

They also taught the “eyes” to handle:

  • Very wide/tall images and high resolutions.
  • Videos, by marking frame order and timing so the model understands what happens when.

Finally, they trained the model to handle long inputs—like long PDFs—with up to 32,768 tokens (a lot of text).

2) Supervised fine-tuning (SFT): teaching it how to “show its work”

Before using feedback-based training, they taught the model to write out its thinking steps and final answers in a clean format:

  • The model writes a “thinking” section (> ... ), then a clear final answer (<answer> ... </answer>).
  • For problems with a definite answer, it puts the final result in a special box (marked with begin/end tokens). This makes it easy to find and check the answer later.

This step helps the model learn to reason in organized steps, which makes later training steadier and more effective.

3) Reinforcement learning (RL): improving with feedback

This is like giving the model practice problems plus a coach who scores its answers:

  • RL with verifiable rewards (RLVR): When a task has a clear right answer (like a math result, a count from a chart, or a specific location in an image), the system checks the boxed answer against the ground truth and gives a reward.
  • RL with human feedback (RLHF): For open-ended tasks (like instructions or explanations), a reward model scores how good the answer is.

Key ideas that made RL work well:

  • Curriculum-style sampling (RLCS): The system picks tasks and examples that match the model’s current skill—like a teacher choosing the right difficulty at the right time, so learning is faster and more stable.
  • Strong, domain-specific “graders”: They built careful checkers for each type of task (math, charts, OCR, grounding, video, GUIs, etc.). This prevents “reward hacking,” where the model finds sneaky ways to trick the grader without truly solving the task.
  • Standardized output with boxed answers: Makes checking correctness reliable and avoids extraction mistakes.

What did they find?

  • The 9B-parameter open model (GLM-4.1V-9B-Thinking) reached or beat the performance of some much larger models on many benchmarks:
    • It beat Qwen2.5-VL-7B on nearly all of 28 public tests.
    • It matched or outperformed the much larger Qwen2.5-VL-72B on 18 benchmarks.
    • On long document understanding and STEM reasoning, it showed competitive or even better performance than closed-source models like GPT-4o in certain tests.
  • Examples of strong results (explained simply):
    • Visual reasoning tests (e.g., MMStar, AI2D): high scores that surpass some larger models.
    • Long documents: better at reading and reasoning over multi-page PDFs.
    • GUI agents: better at answering questions and acting on websites.
    • Coding from visual UIs: much stronger than some baselines.
  • Reinforcement learning gave clear boosts—up to about +7.3% improvement on some tasks.
  • Training across multiple domains helped each domain: practicing on math could also improve chart reading, for example.

Why this matters: These results show that smart training strategies and solid feedback systems can make a smaller, open model perform like much bigger ones on tough, real-world tasks.

Why does this matter?

  • Practical power in a smaller, open package: This model can do a lot—solve STEM problems with images, read long PDFs, understand charts, work with videos, help with web-based tasks, and generate code for user interfaces—without being gigantic.
  • Better “thinking,” not just “seeing”: It doesn’t just label pictures; it reasons through them step by step.
  • Safer, more reliable training: The paper shows how important it is to design good, cheat-resistant graders for RL. This leads to more trustworthy models.
  • Helpful for many users:
    • Students: explaining diagrams, charts, and textbook pages.
    • Professionals: reading long reports, extracting information from documents, analyzing visuals, or automating web tasks.
    • Developers and researchers: the team open-sourced the 9B reasoning model and the base model, plus reward systems and code, so others can build on this work.

In short, GLM-4.1V-Thinking shows that with the right data, careful formats, smart sampling, and strong feedback, an open, mid-sized model can become a versatile “multimodal thinker” that handles complex, real-world problems across pictures, text, and video.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 27 tweets with 554 likes about this paper.