Papers
Topics
Authors
Recent
Search
2000 character limit reached

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Published 29 Apr 2026 in cs.CV | (2604.26752v1)

Abstract: We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a LLM. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

Summary

  • The paper introduces a foundation model unifying vision, language, and action through a novel CogViT encoder and multimodal multi-token prediction head.
  • It leverages a two-stage pretraining with dual teacher models and RL-based multitask optimization, yielding significant benchmark improvements.
  • The study demonstrates practical agentic applications and robust toolchain integration, outlining future challenges and directions for scalable multimodal agents.

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Introduction

GLM-5V-Turbo represents a substantial advance in the design of foundation models explicitly optimized for unified perception, reasoning, and action in multimodal agentic settings (2604.26752). Rather than treating vision-language processing as modular or auxiliary to language-only models, the framework prioritizes native integration of heterogeneous inputs, supporting robust end-to-end performance in scenarios requiring image, video, GUI, and textual context as encountered by practical digital agents.

Model Architecture and Multimodal Innovations

At the core of GLM-5V-Turbo is CogViT, a novel vision transformer-based encoder, optimized for sample-efficient multimodal perception. CogViT employs a two-stage pretraining paradigm: initial distillation-based masked image modeling aligns visual feature spaces via dual teacher models (SigLIP2 for semantic, DINOv3 for texture), followed by large-batch contrastive image-text pretraining across a bilingual 8B-image corpus to unify semantic representation and support cross-lingual understanding (see (2604.26752) for full data details). Architectural innovations include NaFlex for input scalability and QK-Norm for attention stability and logit regularization. Figure 1

Figure 1: Performance comparison of CogViT with other state-of-the-art vision encoders across general and fine-grained multimodal tasks.

The multimodal multi-token prediction (MMTP) head is another pivotal component, generalizing multi-token decoding to settings where image tokens and text must be handled seamlessly. An extensive ablation confirmed that the <|image|> placeholder token design provides superior optimization stability and hardware scaling, as opposed to direct embedding propagation or full masking strategies. This design decouples visual and textual token embedding while preserving rich positional signals for downstream decoding. Figure 2

Figure 2: MMTP design: using a learnable <|image|> token yields lower training loss compared to passing direct visual embeddings to the head.

Training Paradigm: Multimodal RL and Hierarchical Task Coverage

GLM-5V-Turbo training is characterized by its comprehensive RL paradigm over 30+ multimodal and agentic task families. Pretraining utilizes blended datasets of natural images, instructional prompts, and rich interleaved data (image-text, coding, UI layouts, video, OCR, spatialgrounding, STEM, etc.), integrating multimodal representations from inception. Subsequent supervised finetuning and RL strategies leverage domain-heterogeneous reward signals, advanced memory management, distributed batch construction, and topology-aware partitioning for scalable and stable optimization of variable-length, multi-turn, and tool-invoking agent trajectories.

The empirical gains are robust across perception and reasoning domains: image grounding (+4.8% RefCOCO-avg, +3.2% PointBench), video understanding (+5.6% MVBench), 3D grounding (+7.7% SUNRGBD), OCR (+4.2% OCRBench), chart understanding (+7.7% CharXiv), and reasoning tasks (+1.8% STEM datasets). RL-based multitask optimization mitigates domain interference compared to pure SFT and supports transfer and stabilization of complex policy patterns.

Multimodal Agentic Ecosystem and Toolchain Expansion

GLM-5V-Turbo extends its agentic capacity both via an expansive in-house toolchain (encompassing recognition, web and visual search, browser, and image processing operations) and deep integration with external agent frameworks, including Claude Code and AutoClaw. These integrations enable a shift from static predictions to dynamic, interactive perception-planning-execution loops—examples include web asset harvesting, GUI exploration, full-stack code generation, and multimodal tool-augmented research.

The model’s performance is substantiated by strong results on challenging multimodal agentic and tool-use benchmarks: 30.7 on ImageMining, 51.9 on BrowseComp-VL, 72.9 on MMSearch, 75.7 on AndroidWorld, 87.0/80.7 on PinchBench, and 94.8 on Design2Code, outperforming prior models including Claude Opus 4.6 on multimodal code generation. Notably, these results are attained while preserving or improving text-only coding benchmarks, demonstrating no adverse cross-task trade-off. Figure 3

Figure 3: Evaluation of GLM-5V-Turbo on multimodal coding, tool-use, and GUI agent benchmarks.

Figure 4

Figure 4: Evaluation of GLM-5V-Turbo on text-only coding and claw agent benchmarks.

Qualitative Agentic and Content Creation Capabilities

Numerous qualitative demonstrations highlight the breadth of agentic workflows enabled by GLM-5V-Turbo. These include multimodal stock analysis through tool-chained information aggregation, high-fidelity website re-creation from live GUI or PRD documents, and full-stack web design with intricate visual features (parallax, dark mode, dynamic checkout UI). The model is also leveraged for deep research reports with interleaved multimodal evidence, slide and report generation from academic sources, visual searching (e.g., deep image-based question answering), and automated asset collection. Figure 5

Figure 5: Stock analysis via multimodal aggregation and professional report generation.

Figure 6

Figure 6: Full-stack e-commerce website design and implementation, including advanced UI and interactive features.

Figure 7

Figure 7: Automatic website generation for a research paper, integrating extracted text and figures in a structured web format.

Practical Development Lenses and Systemic Implications

The authors articulate key lenses influencing agentic model development:

  • Perception is foundational: Systematic integration of perceptual tasks (e.g., frontend coding, SVG, grounding) directly improves high-level reasoning and observed agentic robustness.
  • Hierarchical optimization: Distributing agentic optimization across perception, action, and trajectory-hierarchy outperforms resource-intensive monolithic long-horizon RL.
  • Specification and verification: Effective agentic benchmarks and training pipelines require clear task boundaries, robust outcome verification, and feedback loops that decouple reward signaling from noisy execution paths.

Benchmarks such as Vision2Web are cited as exemplars of specification/verification co-design, ensuring measurement and optimization are meaningfully aligned.

Outstanding Challenges and Future Directions

The systemic challenges outlined highlight the complexity of achieving generalizable emergence in agentic policies:

  • Autonomous strategy discovery remains highly constrained by the diversity of provided trajectories; overcoming local minima in policy space demands methods for greater diversity or novel self-improving search.
  • Long-horizon context management for multimodal agents is a bottleneck, as neither naive truncation nor text-centric memory scaling adequately preserve spatial and temporal dependencies in images or videos.
  • Model-agent harness co-design is now central: overall capability boundaries, task decomposition, and actualized intelligence depend as much on the orchestration and tool-memory layer as the core neural architecture.

These open directions imply the next generation of agentic systems will necessitate advances in multimodal memory architectures, flexible skill/task decomposition, and collaborative co-evolution of model and harness.

Conclusion

GLM-5V-Turbo demonstrates that integrating perception, reasoning, and action within a unified multimodal model—via architectural, training, and ecosystem advances—enables robust agentic workflows without sacrificing unimodal specialization. The comprehensive empirical and qualitative portfolio, together with the explicit articulation of development lenses and systemic challenges, positions this work as a salient reference point for the progression toward natively multimodal, truly interactive AI agents (2604.26752). Future work in large multimodal agentic systems will require advances not only in foundational model scaling, but in multimodal context management, policy emergence, and tighter model-harness integration for robust, autonomous operation across complex, dynamic digital environments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

What is this paper about?

This paper introduces GLM-5V-Turbo, an AI model that can understand and use both words and visuals (like images, videos, webpages, and app screens). Think of it as a smart assistant with eyes and hands, not just a voice: it can look at pictures, read screens, plan what to do, use tools on the internet, and write code. The big idea is to make visual understanding a built‑in part of the model’s thinking and actions, not an add‑on.

What questions were they trying to answer?

The team focused on simple, practical questions:

  • How do we build an AI that sees, thinks, and acts in real software and web environments—not just chats?
  • How can the “seeing” part (vision) and the “thinking” part (language/reasoning) learn together so the model can plan and use tools better?
  • How can we train this safely and efficiently across many types of tasks (recognizing objects, reading screens, searching the web, writing code, clicking buttons, etc.)?

How did they build and train it?

They combined several ideas so the model’s “eyes,” “brain,” and “hands” work smoothly together.

A new set of “eyes” for the model: CogViT

  • CogViT is a vision encoder—the part of the model that looks at images and turns them into something the AI can understand.
  • It was trained in two steps: 1) Practice filling in missing parts of images (like a smart puzzle) while learning from strong “teacher” models. 2) Learn to match images with the right text descriptions across different sizes and languages.
  • Result: better at fine details (like small text on a chart) and spatial layout (where things are on a page).

Faster, image-friendly “talking”: MMTP

  • MMTP (Multimodal Multi-Token Prediction) helps the model predict several words at once, which makes it faster.
  • Challenge: how do you handle image information in a system designed for text? The team used a simple “placeholder” image token (like a special symbol that means “there’s an image here”) so the model can stay fast and stable while still using what it saw.

Training by “practice with feedback” (reinforcement learning)

  • After basic training, the model practiced over 30 kinds of tasks: spotting things in pictures, reading charts, solving math problems with images, using tools, navigating apps, and writing code.
  • It received feedback from automatic checkers (“judges”) to learn what worked.
  • This broad practice helps the model improve across many areas at once and learn general strategies (not just memorize one task).

A stronger training “gym”

  • They redesigned their training system so it can:
    • Handle mixed tasks (single-step and multi-step).
    • Overlap steps to save time (while one batch runs, another is checked).
    • Manage memory well for pictures and videos (which are big).
    • Balance workloads so long or short videos don’t slow everything down.

More tools and real-world integrations

  • The model can call visual tools (crop images, draw boxes, read webpages, search by image, etc.) and connect to agent frameworks like Claude Code and AutoClaw. This turns it from a “chatbot” into an actor that can see a screen, decide next steps, and execute them.

A new test: ImageMining

  • ImageMining checks whether a model can “think with images” and “search deeply with images.”
  • Tasks often require zooming in on a detail, cropping, then searching the web with that visual clue—like a detective chasing leads through pictures, not just text.

What did they find, and why is it important?

GLM-5V-Turbo performed strongly on many real tasks where seeing and acting matter. In plain terms:

  • It’s good at turning designs into working code, and it keeps strong text-only coding ability.
  • It can use tools to search the web with both text and images, read webpages, and gather evidence from visuals.
  • It can act as a GUI agent: understanding app screens, clicking the right buttons, and following multi-step plans.

Why it matters:

  • Many real jobs (researching online, building websites, analyzing PDFs, exploring apps) are visual and interactive. A model that can see, plan, and act directly on screens is much closer to a helpful digital coworker than one that only chats.

Here are a few concrete capabilities this model showed:

  • Multimodal coding: It can look at a UI design and generate clean frontend code that matches the layout and style.
  • Visual tool use: It can crop images, highlight elements, and search by image to track down facts or sources.
  • GUI tasks: It can navigate app or website interfaces step by step, using screenshots and structured actions.
  • Deep research: It can gather text and images from multiple sources, then produce a well-structured, evidence-backed report or slide deck.

What does this mean for the future?

If AI can natively see and act, it can help with:

  • Building software from mockups and specs faster.
  • Doing “deep research” that uses figures, charts, and screenshots—not just text.
  • Automating multi-step computer tasks (like a careful assistant who follows on-screen instructions).

The paper also shares lessons and challenges:

  • Perception first: Better “seeing” leads to better decisions. Many failures start with misreading a screen or an image.
  • Learn in layers: Training small skills (like “find the button”) helps big skills (like “finish the whole task”) work better and more reliably.
  • Clear goals and checks: For long, multi-step tasks, you need clear instructions and reliable ways to verify success.
  • Open problems remain:
    • Strategy discovery: How can the model invent better plans rather than copy human examples?
    • Memory for visuals: Images and videos are big—how do we remember the right visual details over long tasks?
    • Model + harness: Real ability depends on both the model and the system around it (tools, memory, verification), which need to evolve together.

In short, GLM-5V-Turbo is a strong step toward AI that doesn’t just talk—it sees, plans, and gets things done in real, visual computer environments.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated, concrete list of what the paper leaves missing, uncertain, or unexplored, to guide future research:

  • CogViT transparency and reproducibility:
    • Missing architectural specifics (e.g., parameter counts, patch sizes, depth/width, training steps) and ablation studies (e.g., effect of QK-Norm, Muon, NaFlex) to enable independent replication.
    • Absent quantitative, controlled comparisons against SOTA encoders on standard public leaderboards with exact metrics and datasets.
  • Two-stage vision pretraining design choices:
    • No ablations on teacher selection or mixing (SigLIP2 vs DINOv3 vs alternatives), masking ratios, or data-mixture proportions to justify the adopted recipe.
    • Effects of the 8B bilingual image-text corpus composition (domains, quality filtering, licensing) on generalization, bias, and legality are unreported.
  • MMTP with shared <|image|> token:
    • Unquantified information loss from collapsing all visual tokens to a single learnable token at the MTP head; unclear impact on tasks requiring fine-grained visual conditioning during decoding.
    • Ablation is only on a 0.5B model; it remains unknown whether the observed stability/performance benefits hold at larger scales and across task families.
    • No comparison of MMTP inference latency/throughput vs. alternatives (e.g., direct visual embedding pass-through) in real agent loops.
  • Video and temporal modeling:
    • The paper reports MVBench gains but does not describe how video is represented (frame sampling, temporal encoders, aggregation) or how memory over time is handled.
    • Lack of evaluations in streaming or real-time video settings and on long-duration temporal reasoning benchmarks.
  • Multi-task RL coverage vs. forgetting:
    • Catastrophic forgetting in capabilities not covered by RL is acknowledged but unmeasured; no mitigation (e.g., rehearsal, EWC, constrastive regularizers) is evaluated.
    • No quantitative study of how RL task coverage breadth/weighting shapes generalization boundaries or how proxy tasks transfer across domains.
  • Reward modeling and verifier reliability:
    • Sparse detail on reward aggregation strategies, quality control, and calibration of model-based judges; susceptibility to reward hacking and verifier exploitation remains untested.
    • No cross-judge agreement studies, error taxonomies, or sensitivity analyses showing how verifier noise affects RL outcomes.
  • Hierarchical optimization vs. monolithic training:
    • Although advocated, no controlled experiments compare hierarchical curricula against monolithic end-to-end training on identical tasks/data to quantify stability/efficiency gains.
  • Tool-use generalization and robustness:
    • The model integrates multiple proprietary tools (zai_*), but generalization to unseen tool APIs, schema changes, and tool failures is not evaluated.
    • Robustness to dynamic webpages, DOM drift, CAPTCHAs, network errors, and adversarial page content (e.g., prompt injection in HTML/JS) is untested.
  • Long-horizon multimodal memory:
    • While identified as a bottleneck, no concrete mechanisms or experiments are provided for visually faithful memory/summary, retrieval, or compression over extended trajectories.
    • No benchmarks or metrics for evaluating multimodal memory fidelity (preservation of layout, spatial relations, or temporal states) under tight context budgets.
  • Safety, security, and alignment in agentic settings:
    • No systematic evaluation of defenses against prompt injection, data exfiltration, sensitive-content handling, or unsafe code execution when browsing/using tools.
    • Absence of red-teaming results, refusal behaviors, or policy-compliance metrics in multimodal, tool-enabled contexts.
  • Biases and fairness:
    • Cross-lingual coverage is limited to Chinese-English; performance and bias across other languages, scripts, and regions are unexamined.
    • Tools such as person recognition raise fairness/privacy concerns; the paper provides no bias audits or mitigation strategies.
  • Efficiency and cost:
    • Large-batch (64K) pretraining and large-scale RL infrastructure are described, but compute budgets, energy consumption, and cost-performance trade-offs are not reported.
    • End-to-end agent latency (planning + perception + tool calls) and throughput in practical deployments are unquantified.
  • Data and benchmark openness:
    • Core pretraining and RL datasets (beyond ImageMining) are not released or sufficiently described for replication; potential contamination with evaluation sets is not addressed.
    • ImageMining’s scale (217 cases) and curation process may limit generality; leakage checks, inter-annotator agreement, and difficulty calibration are not reported.
  • Interpretation of gains:
    • Many reported improvements are relative to the authors’ prior models or in harness-specific conditions; controlled head-to-head comparisons with public baselines under identical tool/internet settings are lacking.
    • Limited error analysis to attribute gains to perception, planning, or harness policies; no ablation on the harness’s contribution to benchmark scores.
  • Model–harness entanglement:
    • While acknowledged, the paper does not provide systematic methodology to disentangle model vs. harness contributions (e.g., swapping harness components, standardized harness-agnostic probes).
    • No guidance on how harness designs should co-evolve with model capability regimes or how to benchmark across differing harnesses fairly.
  • Generalization beyond GUIs and web:
    • Transfer to non-web GUIs (e.g., desktop apps with custom rendering), embedded systems, or AR/VR interfaces is unexplored.
    • Physical-world perception–action loops (robotics, egocentric vision) are out of scope but central to “native” agentic grounding.
  • Error handling and recovery:
    • The agent’s strategies for detecting, explaining, and recovering from tool failures, OCR errors, or misperception are not characterized or benchmarked.
  • Hallucination and factuality:
    • Although self-critique is mentioned for GUI tuning, there is no quantitative evaluation of hallucination rates in multimodal generation, especially for visually grounded claims.
  • Explainability and strategy discovery:
    • Open question remains on how to induce and verify emergent agentic strategies (sub-agent decomposition, multi-agent collaboration); no experiments or metrics are proposed.
  • Security and privacy of training data:
    • Legal/ethical handling of the 8B bilingual image-text corpus (copyright, PII, biometric data) is not documented; privacy-preserving training practices are unspecified.
  • Scaling laws and transfer:
    • No scaling analyses (data/model/compute) for multimodal RL or MMTP; transfer learning behavior across modalities/tasks as a function of scale remains unknown.
  • Deployment reliability:
    • Absence of stress tests under extreme conditions (low-res/occluded images, noisy OCR, device constraints) and of monitoring/rollback strategies for production agents.
  • Formal evaluation protocols:
    • Lack of standardized, controlled evaluation protocols that fix tool availability, network state, and judge policies to ensure reproducibility across labs.
  • Release artifacts:
    • It is unclear which components (model weights, CogViT encoder, RL Gym, verifiers, tools) will be released, hindering community verification and extension.

These gaps highlight concrete avenues for future work: publish reproducible specs and datasets; quantify MMTP trade-offs; develop and evaluate multimodal memory and safety defenses; run controlled, harness-agnostic benchmarks; and study scaling, bias, and robustness systematically.

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can be built today using GLM-5V-Turbo’s capabilities, tools, and integrations (e.g., Z.ai tools, Claude Code, AutoClaw/OpenClaw). Each item notes relevant sectors and key dependencies that may impact feasibility.

Industry

  • Multimodal RPA and GUI automation for operations
    • Sectors: software, IT operations, customer support, finance back-office
    • What: Automate complex desktop/web workflows by perceiving on-screen elements and executing actions via AutoClaw/OpenClaw; navigate dashboards, browser-based CRMs/ERPs, and forms; validate outcomes with screenshots.
    • Tools/workflows: AutoClaw + GLM-5V-Turbo as vision-language controller; BrowseComp-VL style browsing; zai_read_webpage, zai_load_image_from_url; model-guided screenshotting and on-image annotations (e.g., zai_draw_image_bounding_boxes).
    • Dependencies/assumptions: Stable browser/OS harness; permissioned access to systems; clear task specification and verifiers; organization-approved automation policies; GPU availability for vision-heavy inference.
  • Design-to-code and rapid web/app prototyping
    • Sectors: software, product design, martech
    • What: Convert UI mockups/screenshots into HTML/CSS/JS; reproduce websites from screenshots; generate slide decks and web pages from PRDs and visual references.
    • Tools/workflows: Design2Code capability; Vision2Web-style specification and workflow-based verification; zai_generate_web_html, submit_plan, apply_edits, zai_generate_web_outline/slide_html.
    • Dependencies/assumptions: Consistent UI assets; agreed verification (visual/functional); integration with Claude Code for filesystem/terminal operations; rights to reuse referenced media.
  • Enterprise document processing and analytics with visual grounding
    • Sectors: finance, consulting, compliance, legal, healthcare admin
    • What: Parse reports with charts/tables, extract data via OCR/visual grounding, and produce interleaved summaries or briefings where figures are embedded alongside text.
    • Tools/workflows: OCR/Chart understanding gains (CharXiv, OCRBench); multimodal deep research tools (zai_dr_open_url_mm, zai_dr_visit_img); figure/table cropping and insertion; interleaved markdown generation.
    • Dependencies/assumptions: Data privacy controls; document variety (layouts, languages); access to internal repositories; human-in-the-loop verification for regulated contexts.
  • Visual product and media search by image
    • Sectors: e-commerce, media, DAM (digital asset management)
    • What: Identify products from photos, find visually similar items, locate original sources or usage rights.
    • Tools/workflows: zai_search_web_by_image, zai_search_similar_images, zai_search_web_images; on-image cropping before search to isolate entities; “Deep-Wide-Search” strategy validated by ImageMining.
    • Dependencies/assumptions: Internet access; search API quotas; content licensing policies.
  • Automated UI testing and visual regression checks
    • Sectors: software QA, DevOps
    • What: Use fine-grained perception and grounding to locate UI elements, execute actions, and compare UI states across builds; detect visual layout and interaction regressions.
    • Tools/workflows: GUI-agent competencies (OSWorld/AndroidWorld); image annotation and pointing; trajectory-level action prediction; workflow-based verification for pass/fail.
    • Dependencies/assumptions: Stable test environments; golden baselines; CI/CD integration.
  • Code assistant with visual context
    • Sectors: software engineering
    • What: Pair text-only coding strength with multimodal cues (e.g., screenshots of failing UIs, design references) for bug fixes, feature implementation, and repo exploration.
    • Tools/workflows: Claude Code integration; CC-Bench-V2 capabilities; repository exploration and local execution; UI-to-code for front-end parity.
    • Dependencies/assumptions: Secure local runtime; guardrails for file/system access; developer review.
  • Competitive/market intelligence reports with embedded evidence
    • Sectors: corporate strategy, marketing, consulting
    • What: Generate text-image interleaved reports featuring screenshots, figures, and web-captured evidence; create slide decks from multimodal sources.
    • Tools/workflows: Multimodal deep research loop (planning → browsing → screenshotting → extraction → synthesis); zai_dr_search, images_lens; automated slide/outline generation.
    • Dependencies/assumptions: Web access; content usage rights; company style templates.

Academia

  • Figure-aware literature reviews and teaching materials
    • What: Compile surveys and lecture slides that preserve diagrams, charts, and tables; extract and annotate key visual evidence from papers.
    • Tools/workflows: Document-grounded generation; figure cropping/insertion; interleaved markdown/PPT generation.
    • Dependencies/assumptions: Access to publishers’ content; proper citation and fair-use policies.
  • Benchmarking and agent evaluation with ImageMining
    • What: Use ImageMining to evaluate and compare multimodal search and perception-planning-execution loops; probe tool-usage precision (cropping, magnification).
    • Tools/workflows: ImageMining benchmark; multimodal tool-use tasks; clear task specs and verifiers.
    • Dependencies/assumptions: Reproducible harness; standardized tool stacks; dataset licensing.
  • Prototyping multimodal RL systems
    • What: Adopt the VLM RL Gym for building heterogeneous single-/multi-step tasks with unified reward orchestration; experiment with hierarchical optimization and on-policy distillation.
    • Tools/workflows: Unified task/reward abstraction; asynchrony and stage-overlap pipeline; memory and load-balancing techniques for ViT-heavy workloads.
    • Dependencies/assumptions: GPU clusters; verifier infrastructure (rule-based/model-based judges); engineering capacity.

Policy/Government

  • Multimodal document triage and public briefings
    • What: Process policy documents, reports, and public datasets with embedded charts/maps; generate briefings that ground claims in extracted visuals.
    • Tools/workflows: OCR/Chart understanding; interleaved report and PPT generation.
    • Dependencies/assumptions: Records management policies; provenance tracking; human oversight in high-stakes outputs.
  • Fact-checking and OSINT with image-grounded search
    • What: Verify events, locations, and sources via image-based geolocation and cross-site retrieval.
    • Tools/workflows: Spatio-temporal reasoning; zai_search_web_by_image; localized cropping to refine queries.
    • Dependencies/assumptions: Secure connectivity; audit trails; handling of sensitive content.

Daily Life

  • Personal research and presentation drafting
    • What: Create school/work presentations or blogs from mixed sources (webpages, PDFs, images), embedding screenshots/figures automatically.
    • Tools/workflows: Interleaved markdown/PPT generation; multimodal browsing and evidence gathering.
    • Dependencies/assumptions: Content rights; internet access.
  • On-the-go image recognition and search
    • What: Identify plants/landmarks, find where an image came from, or locate similar visuals.
    • Tools/workflows: zai_recognize_plant/location/person; web-by-image search; in-image cropping first for accuracy.
    • Dependencies/assumptions: Mobile capture quality; network access; regional content policies.
  • Accessibility helpers for on-screen content
    • What: Describe GUIs, charts, and images to assist users with visual impairments; guide step-by-step interactions.
    • Tools/workflows: GUI grounding and pointing; screen comprehension; narrated instructions.
    • Dependencies/assumptions: Reliable screen-capture; latency constraints; safety and privacy requirements.

Long-Term Applications

These opportunities require further research, scaling, or system design (e.g., multimodal-native memory, strategy emergence, tighter harness co-design) before broad deployment.

Industry

  • End-to-end autonomous enterprise agents across heterogeneous apps
    • Sectors: finance ops, customer success, IT automation
    • What: Agents that manage multi-application workflows with on-the-fly planning, verifiable subgoal decomposition, and adaptive tool-use over long horizons.
    • Likely enablers: Hierarchical optimization pipelines; robust task specifications and verifiers; model–harness co-design for reliability.
    • Dependencies/assumptions: Auditable decision logs; policy-compliant automation; long context and memory beyond text-centric summaries.
  • Multimodal-native memory services
    • Sectors: platform/ML infrastructure
    • What: Memory systems that compact and retrieve “what was seen” (layout, spatial relations, video temporal change) alongside “what was said,” enabling long-horizon multimodal reasoning.
    • Likely enablers: Visual state summarization and retrieval; ViT-aware memory compaction; cross-modal indexing.
    • Dependencies/assumptions: New algorithms for visual compression without losing task-relevant detail; standardized APIs.
  • AR/VR copilot overlays for task guidance
    • Sectors: field service, manufacturing, healthcare workflows
    • What: Real-time visual perception and step-by-step overlays for procedures, inspections, and troubleshooting.
    • Likely enablers: CogViT’s fine-grained spatial perception; on-device or edge inference; stable latency and safety tooling.
    • Dependencies/assumptions: Wearable hardware; secure on-prem inference; domain-specific validation.
  • Robotics and embodied agents
    • Sectors: logistics, manufacturing, service robots
    • What: Transfer fine-grained perception and hierarchical planning to physical manipulation and navigation.
    • Likely enablers: Integration with robot stacks; sensor fusion beyond images (depth, tactile); safety verifiers.
    • Dependencies/assumptions: Real-world control policies; sim-to-real transfer; rigorous safety certification.
  • Fully automated web development from rich specifications
    • Sectors: software, no-/low-code platforms
    • What: Build production-grade sites/apps from PRDs, mockups, assets, and reference pages with workflow-based verification.
    • Likely enablers: Vision2Web-style pipelines; stronger functional and visual verification loops; tighter CI integration.
    • Dependencies/assumptions: Testable specs; human QA gates; governance for code quality and security.

Academia

  • Strategy-emergent agent training
    • What: Methods that let agents discover qualitatively new planning and action patterns beyond human seeded trajectories; use multi-task RL with diverse cold starts.
    • Likely enablers: Diverse trajectory generation; on-policy distillation; curriculum and hierarchical RL.
    • Dependencies/assumptions: Compute and data diversity; robust evaluation protocols to detect genuine strategy shifts.
  • Standardized, controlled benchmarks for long-horizon multimodal tasks
    • What: Datasets and harnesses that pair clear task specs with reliable, multi-stage verifiers to produce reusable optimization signals.
    • Likely enablers: Expansion of ImageMining-style datasets and Vision2Web’s workflow-based verification.
    • Dependencies/assumptions: Community consensus on specs; reproducible tool stacks; licensing.

Policy/Government

  • Auditable autonomous decision systems for public services
    • What: Multimodal agents executing complex back-office and citizen-facing workflows with verifiable decision trails and robust guardrails.
    • Likely enablers: Clear task definitions and outcome verifiers; model-harness co-shaping of capability boundaries; safety governance frameworks.
    • Dependencies/assumptions: Regulatory approvals; privacy-by-design; red-teaming for multimodal failure modes.
  • Privacy-preserving, on-prem visual agents
    • What: Deploy agents that process sensitive images/documents on secure infrastructure with strict data residency.
    • Likely enablers: Efficient ViT/memory management (MMTP, topology-aware partitioning) adapted for edge/on-prem clusters.
    • Dependencies/assumptions: Hardware investment; MLOps maturity; compliance audits.

Daily Life

  • Persistent personal desktop agents
    • What: Agents that manage apps on your behalf over weeks/months, remembering visual states and prior decisions across sessions.
    • Likely enablers: Multimodal-native memory; reliable GUI grounding across software updates; safe action verifiers.
    • Dependencies/assumptions: OS-level permissions; trust and safety guardrails; local compute or private-cloud options.
  • Lifelong personal knowledge bases built from screenshots, videos, and documents
    • What: Organize, retrieve, and reason over a visual history (e.g., receipts, dashboards, whiteboard photos).
    • Likely enablers: Visual indexing and summarization; privacy-preserving storage; cross-device sync.
    • Dependencies/assumptions: Storage/security; user controls for retention; effective retrieval UX.

Notes on Cross-Cutting Assumptions and Dependencies

  • Toolchain availability: Many “zai_” tools are accessible via Z.ai; alternatives or custom tools may be needed in private deployments.
  • Connectivity and quotas: Multimodal deep search relies on web access and API limits; offline or on-prem deployments will need local corpora and search.
  • Accuracy and safety: Fine-grained perception errors can propagate to planning and execution; high-stakes use requires human oversight and verifiers.
  • Language coverage: Pretraining emphasizes Chinese–English; performance may vary for other languages without additional adaptation.
  • IP and content rights: Image scraping and embedding in outputs must follow licensing and fair-use policies.
  • Compute: Vision-heavy workloads demand GPUs and optimized memory pipelines (e.g., MMTP, topology-aware partitioning) for scale and latency.

Glossary

  • ablation study: An experimental analysis where components are systematically removed or altered to assess their impact on performance. "Empirically, according to the ablation study on a 0.5B model, the <|image|>-based design achieves lower training loss and more stable convergence than directly using visual embeddings."
  • activation memory: The memory required to store intermediate activations during neural network forward/backward passes. "This prevents activation memory from scaling linearly with the number of images in the naïve way, and substantially reduces runtime memory pressure while preserving overall computational efficiency."
  • agentic capability: A model’s ability to autonomously plan, decide, and act in interactive environments. "As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs."
  • all-to-all communication: A distributed systems communication pattern where each node exchanges data with all others. "After load balancing across DP groups, precise dispatch is carried out through asynchronous all-to-all communication, so that each rank receives only the partition it actually needs."
  • attention computation: The process of computing attention scores in transformer architectures. "Additionally, we introduce QK-Norm to normalize query and key vectors before attention computation, effectively mitigating logit explosion and ensuring stability at scale."
  • bilingual (Chinese-English) image-text corpus: A dataset containing aligned images and texts in two languages, used for cross-lingual training. "…utilizing an 8-billion bilingual (Chinese-English) image-text corpus to enhance cross-lingual understanding."
  • bin-packing: Grouping variable-sized items into bins to optimize resource utilization, here for batching variable sequences. "For the variable-length sequences produced during rollout, we additionally perform joint bin-packing over both sequence length and ViT token count, leading to better-balanced micro-batches for both compute and memory pressure."
  • bidirectional distributed implementation: A distributed training setup that processes data in both directions to improve efficiency. "…using the sigmoid-based SigLIP loss, combined with a bidirectional distributed implementation for efficiency;"
  • contrastive image-text pretraining: Training to align visual and textual representations by bringing matched pairs closer in embedding space and pushing mismatched pairs apart. "The second stage shifts to contrastive image-text pretraining to align visual and textual features in a shared embedding space."
  • context budget: The available capacity (e.g., tokens) for storing past context in models with limited context windows. "Compared with text, images and especially videos consume context budget much more aggressively, making them expensive to retain over long trajectories."
  • context parallelism: A parallelism strategy that partitions sequences across devices to scale context length or throughput. "…this design remains naturally compatible with existing partitioning strategies such as sequence parallelism and context parallelism…"
  • cosine decay schedule: A learning rate schedule that decays the rate following a cosine function. "We optimize with Muon optimizer with a cosine decay schedule."
  • CPU offloading: Moving certain computations or tensors to CPU memory to reduce GPU memory pressure. "…combining targeted recomputation with CPU offloading."
  • CP (context parallelism) partitioning: Partitioning workloads by segments of the input sequence across devices. "To address this, we move CP and TP partitioning upstream into the data-loading stage and align partition boundaries with downsample groups…"
  • cross-modal alignment: Ensuring representations from different modalities (e.g., vision and language) are compatible or aligned. "To balance representation learning with cross-modal alignment, we employ a two-stage pretraining recipe."
  • data-source tag: Metadata attached to training samples indicating their origin to enable per-source metric tracking. "To improve observability in mixed-task training, each sample also carries a data-source tag, allowing source-specific metrics such as reward and pass@k to be aggregated across parallel groups and reported separately."
  • DINOv3: A self-supervised vision model used as a teacher for extracting texture features. "…dual teacher models: SigLIP2 for semantic representations and DINOv3 for texture features."
  • distillation-based masked image modeling: A training approach where a student reconstructs masked image regions using teacher-provided features. "In the first stage, we use distillation-based masked image modeling to strengthen visual representations."
  • DP (data parallel) groups: Replicas of model parameters across devices processing different mini-batches, used for scaling training. "After load balancing across DP groups, precise dispatch is carried out through asynchronous all-to-all communication…"
  • embedding space: The vector space in which data from one or more modalities are represented. "The second stage shifts to contrastive image-text pretraining to align visual and textual features in a shared embedding space."
  • end-to-end verification: Validating the entire pipeline of an agentic process from input to final outcome. "…highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification."
  • fine-grained understanding: Detailed perception of objects, attributes, and spatial relations in images beyond coarse semantics. "We develop CogViT, a novel parameter-efficient vision encoder tailored for multimodal perception and downstream agent-oriented tasks. It delivers strong capabilities in general object recognition, fine-grained understanding, as well as geometric and spatial perception."
  • foundation models: Large pre-trained models designed to be adapted for a wide range of downstream tasks. "Recent advances in foundation models have driven a shift from language understanding to agentic real-world interaction…"
  • GUI grounding: Linking textual references to specific GUI elements or locations on a screen. "…spanning element perception, GUI grounding, single-step action prediction, and trajectory-level action prediction…"
  • harness design: The surrounding system components (tools, memory, workflows) that enable a model to act effectively. "…the growing entanglement between model capability and harness design."
  • hierarchical optimization: Training across multiple levels of task granularity to build complex capabilities efficiently. "Agent capability can be more efficiently built through hierarchical optimization."
  • image grounding (2D): Associating textual phrases with specific 2D regions in an image. "…the model improves on tasks such as 2D image grounding and pointing…"
  • image special token (<|image|>): A shared learnable placeholder token representing visual inputs to simplify processing. "Compared with directly passing visual embeddings to the MTP head, using the <|image|> token removes the need to propagate visual embeddings across pipeline-parallel stages…"
  • logit explosion: Extremely large pre-softmax values that can destabilize training. "…effectively mitigating logit explosion and ensuring stability at scale."
  • long-horizon tasks: Tasks requiring many steps or extended interactions to complete. "The key to constructing, evaluating, and optimizing end-to-end long-horizon tasks lies in clear task specification, reliable outcome verification, and controlled evaluation procedures."
  • micro-batches: Small sub-batches processed sequentially to fit memory constraints during large-batch training. "…leading to better-balanced micro-batches for both compute and memory pressure."
  • model-based judges: Learned evaluators used to assess outputs and provide rewards or feedback. "…while model-based judges are invoked asynchronously through APIs; their outputs are then combined into rewards through configurable aggregation strategies…"
  • module-specific learning rates: Different learning rates assigned to distinct model components. "We continue to optimize with Muon, assigning module-specific learning rates and decay schedules to the vision, text, and projection components."
  • multi-step tasks: Tasks that require multiple sequential actions or stages. "…support both single-step and multi-step tasks…"
  • multi-token prediction (MTP): Predicting multiple future tokens per step to improve efficiency or training signals. "In standard text-only MTP, prefix tokens can be passed into the MTP head directly through token IDs and embedded with the word embedding layer."
  • Multimodal Multi-Token Prediction (MMTP): An extension of MTP that handles both text and visual inputs. "We propose Multimodal Multi-Token Prediction (MMTP), a multimodal extension of multi-token prediction (MTP)…"
  • multi-task RL: Reinforcement learning conducted over multiple task types jointly. "This multi-task RL setting also exhibits several properties that we have consistently observed…"
  • NaFlex: A variable-resolution input scheme preserving aspect ratios for vision encoders. "…replacing the fixed 224 × 224 resolution with the NaFlex scheme to process variable-size inputs while preserving original aspect ratios;"
  • OCR: Optical Character Recognition, extracting text from images. "…OCR (+4.2% on OCRBench)…"
  • on-policy distillation: Transferring behaviors from the current policy to a student or consolidated model during RL. "Taken together, these observations suggest that multi-task collaborative RL, including on-policy distillation, is not merely a tool for improving individual capabilities…"
  • pipeline-parallel stages: Splitting a model across devices by layers to process different micro-batches in a pipeline. "…using the <|image|> token removes the need to propagate visual embeddings across pipeline-parallel stages…"
  • positional information: Encoding of token or patch positions used to preserve ordering or spatial relationships. "The third preserves visual positional information, but replaces all visual tokens with a shared learnable <|image|> special token…"
  • projection components: Layers that map modality-specific features into a shared representation space. "…assigning module-specific learning rates and decay schedules to the vision, text, and projection components."
  • proxy tasks: Auxiliary tasks used to train capabilities that transfer to more complex targets. "Even when a target capability cannot be easily formulated directly as an RL task, semantically or structurally related proxy tasks may provide useful optimization signals."
  • QK-Norm: Normalization applied to query and key vectors in attention to stabilize training. "Additionally, we introduce QK-Norm to normalize query and key vectors before attention computation…"
  • reference model: A fixed or separately maintained model used for comparisons or as a baseline during RL. "For the reference model, parameters remain resident on CPU memory, are asynchronously prefetched to GPU immediately before reference forward, and are released right after use…"
  • reinforcement learning (RL): A learning framework where agents learn by receiving rewards from their actions’ outcomes. "In the agent era, training infrastructure faces much stricter demands on both efficiency and stability, especially in large-scale multi-task multimodal reinforcement learning (RL)."
  • rollout inference: Generating trajectories or action sequences by running the current policy during RL. "We restructure the training pipeline to decouple rollout inference, reward evaluation, batch construction and weight transfer…"
  • rule-based verifiers: Deterministic evaluators that check outputs against explicit rules. "Rule-based verifiers are executed locally and synchronously…"
  • sequence parallelism: Distributing different parts of a long sequence across devices to increase effective context or throughput. "…compatible with existing partitioning strategies such as sequence parallelism and context parallelism…"
  • SigLIP loss: A sigmoid-based contrastive loss used to align image and text representations. "…using the sigmoid-based SigLIP loss, combined with a bidirectional distributed implementation for efficiency;"
  • SigLIP2: A contrastive vision-LLM used as a teacher for semantic representations. "…dual teacher models: SigLIP2 for semantic representations and DINOv3 for texture features."
  • student ViT: A Vision Transformer model trained to learn from teacher features. "Specifically, we train the student ViT to reconstruct the masked regions…"
  • supervised fine-tuning (SFT): Refining a pre-trained model on labeled data for specific tasks. "Compared with the cross-domain trade-offs often seen in SFT, RL tends to show weaker interference across domains…"
  • TP (tensor parallelism) partitioning: Splitting individual tensor computations across devices to parallelize large layers. "To address this, we move CP and TP partitioning upstream into the data-loading stage and align partition boundaries with downsample groups…"
  • topology-aware partitioning: Partitioning that accounts for hardware or data topology to minimize communication and imbalance. "For visual inputs such as long videos, where sequence lengths vary significantly, we further introduce a topology-aware partitioning and dynamic load-balancing scheme."
  • trajectory-level action prediction: Predicting sequences of actions across an entire interaction trajectory. "…spanning element perception, GUI grounding, single-step action prediction, and trajectory-level action prediction…"
  • unified VLM RL Gym: A common environment interface for training vision-LLMs with RL across diverse tasks. "We build a unified VLM RL Gym that provides a consistent environment interface for both single-step and multi-step tasks…"
  • ViT (Vision Transformer): A transformer-based architecture for processing images by dividing them into patches. "Specifically, we train the student ViT to reconstruct the masked regions…"
  • visual embeddings: Vector representations of visual inputs produced by a vision encoder or projector. "Compared with directly passing visual embeddings to the MTP head, using the <|image|> token removes the need to propagate visual embeddings across pipeline-parallel stages…"
  • vision encoder: A neural network component that converts images into feature embeddings. "We develop CogViT, a novel parameter-efficient vision encoder tailored for multimodal perception and downstream agent-oriented tasks."
  • vision-LLM (VLM): A model jointly processing visual and textual inputs. "We build a unified VLM RL Gym that provides a consistent environment interface for both single-step and multi-step tasks…"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 139 likes about this paper.