Papers
Topics
Authors
Recent
Search
2000 character limit reached

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Published 3 Mar 2026 in cs.CV | (2603.03276v1)

Abstract: The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

Summary

  • The paper demonstrates that semantic encoders outperform VAEs for both visual generation and understanding, maintaining competitive text perplexity.
  • The study shows that strategically mixing image, text, and video data yields complementary improvements in VQA and world modeling tasks.
  • The paper reveals that dynamic MoE architectures efficiently balance modality scaling asymmetry, enabling compute-optimal multimodal performance.

Authoritative Summary of "Beyond Language Modeling: An Exploration of Multimodal Pretraining" (2603.03276)

Introduction

"Beyond Language Modeling: An Exploration of Multimodal Pretraining" addresses the foundational limitations of text-only pretraining and systematically investigates the principles, design tradeoffs, and scaling behaviors underpinning unified multimodal foundation models. Unlike prevailing paradigms that fine-tune language-pretrained backbones for multimodality, this work employs controlled, from-scratch pretraining experiments in the Transfusion framework, decoupling language and vision dynamics. The study focuses on five axes: visual representation, data composition, world modeling emergence, architectural specialization, and scaling laws. The goal is to empirically clarify the interactions and scaling relationships between modalities, providing actionable guidance for the development of native multimodal models. Figure 1

Figure 1: Overview of the study, showing the unified architecture and five axes of investigation: visual representations, data, world modeling, architecture, and scaling.

Visual Representation Ablations

The prevailing assumption in the literature is that visual understanding and generation require separate representations: semantic encoders for discriminative tasks, and VAE latents for generative tasks. "Beyond Language Modeling" rigorously tests this dogma by comparing VAE-based latents (SD-VAE, FLUX.1), semantic encoders (SigLIP 2, DINOv2, WebSSL), and raw pixel inputs.

Results demonstrate that Representation Autoencoders (RAE, e.g., SigLIP 2) outperform traditional VAEs across both visual generation (DPGBench, GenEval) and understanding (VQA), while maintaining text perplexity comparable to text-only training. Semantic encoders yield unified representations that are optimal for both tasks and scale favorably with increased MoE sparsity, while VAE performance saturates. Figure 2

Figure 2: RAE (SigLIP 2) outperforms VAE for both generation and understanding, achieving superior performance on key metrics.

Implication: The architectural dichotomy separating understanding and generation is not justified—semantic representations suffice for unified multimodal models, simplifying design and improving scalability.

Impact of Data Mix and Co-training Synergy

The study meticulously quantifies how visual and language data interact during joint pretraining. The experiments reveal that visual data (especially raw video) does not negatively impact language modeling; in fact, video co-training matches or improves text perplexity relative to text-only baselines, indicating compatibility rather than competition.

The introduction of image-text data sources is critical for visual generation and understanding capabilities, but can marginally degrade text perplexity due to distributional shifts in captions. Detailed analysis suggests that mixing strategically curated image-text sources (MetaCLIP for image-to-text, Shutterstock for text-to-image) yields complementary improvements across text and vision metrics. Figure 3

Figure 3: Diverse image-text sources provide distinct strengths; combining I/T data strategically outperforms single-source approaches.

Moreover, multimodal co-training yields clear synergy: adding text tokens to vision budgets consistently enhances both visual generation and VQA accuracy, outperforming unimodal pretraining even with less in-domain data. Figure 4

Figure 4

Figure 4: Multimodal co-training consistently improves generation quality across token budgets.

Emergence of World Modeling

The paper extends unified multimodal pretraining to world modeling, framing navigation actions as text tokens and predicting future visual states. Ablation studies demonstrate that general multimodal pretraining (e.g., unsupervised video) provides far greater gains in navigation accuracy (ATE, RPE) than simply scaling domain-specific trajectory data.

Surprisingly, competitive world modeling performance saturates with as little as 1% domain-specific data, showcasing strong transfer from general multimodal capabilities. The model attains zero-shot controllability: it can autonomously generate navigation rollouts from free-form natural language commands without any architectural adaptations. Figure 5

Figure 5: The unified model predicts future trajectories conditioned on both WASD and natural language actions.

Implication: General multimodal pretraining unlocks embodied reasoning and world modeling, opening the pathway to more generalizable agents in physical environments.

Architectural Innovations: Dynamic Capacity Separation via MoE

The authors systematically evaluate Mixture-of-Experts (MoE) architectures, highlighting the necessity of learnable, dynamic capacity separation between modalities. MoE models are shown to scale multimodal performance efficiently; increasing expert granularity and sparsity improves both language (lower perplexity) and vision (lower diffusion loss, higher GenEval scores) at fixed compute.

Emergent expert specialization is observed: MoE naturally allocates more experts to text than vision, reflecting their scaling asymmetry. Modality-focused shared experts further boost performance versus global shared experts.

Interestingly, visual generation and understanding activate highly overlapping sets of vision experts, negating the need for hard separation within the architecture. Figure 6

Figure 6

Figure 6: Expert specialization evolves across network depth, with modal-specific experts forming organically.

Figure 7

Figure 7: Visual generation and understanding tasks share the same experts (Pearson r0.90r \geq 0.90).

MoE outperforms both dense and hand-crafted separation strategies (e.g., modality-specific FFN, Mixture-of-Transformers) on all metrics, including WISE factual generation benchmarks. Figure 8

Figure 8: Semantic encoders with MoE capture substantially more world knowledge, with 3–4x better scores than VAE-based models.

Scaling Laws: Compute-Optimal Allocation and Asymmetry

IsoFLOP scaling studies reveal a core asymmetry: vision is considerably more data-hungry than language. Dense models cannot efficiently satisfy the data demands of both modalities simultaneously. For vision, the optimal data-to-parameter exponent (bb) is much higher than that for language (aa).

MoE models harmonize this gap by decoupling total parameters from active compute and sparsely allocating capacity. MoE narrows the vision-language scaling exponent gap from 0.10 (dense) to 0.05 (MoE), enabling efficient joint scaling and compute-optimal performance across modalities. Figure 9

Figure 9: Compute-optimal model and token count projections highlight vision’s growing data demand as scale increases.

Figure 10

Figure 10: MoE scaling laws reduce vision-language asymmetry and deliver competitive multimodal performance.

Implication: Sparse MoE architectures are vital not only for efficiency but for structural flexibility, enabling fundamentally different scaling behaviors for vision and language to coexist within a single unified foundation model.

Practical and Theoretical Implications, Future Directions

This work clarifies previously obscured dynamics in multimodal pretraining and debunks common beliefs about inherent modality competition and architectural necessity for representation separation. Practical recommendations include adopting unified RAE-based encoders, strategically mixing diverse data sources, employing MoE for scalable multimodal routing, and leveraging general video data for world modeling capabilities.

Theoretical implications concern the strong evidence of emergent world modeling from general multimodal pretraining and the structural role of sparsity in balancing asymmetric scaling laws. The findings anticipate the convergence of multimodal and world models; as models scale and data sources diversify (especially unlabeled video), unified models are expected to achieve native embodied reasoning and richer generative control without specialized architectures.

Conclusion

"Beyond Language Modeling: An Exploration of Multimodal Pretraining" provides empirical guidance for foundational multimodal models, demonstrating that with appropriate representation, data composition, and architectural choices (notably MoE with semantic encoders), vision and language modalities can be unified without mutual degradation. Multimodal synergy, emergent world modeling, and asymmetry harmonization via sparsity collectively establish the criteria for future foundation models that transcend text and pave the road toward general-purpose, environment-grounded AI.

Whiteboard

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper explores how to build one big AI model that can understand and create both words and visuals (like images and videos) at the same time. Instead of starting from a language-only model and adding vision later, the researchers train everything together from scratch. They want to see what actually works best when teaching a model about the real world, not just text.

What questions the researchers asked

They focused on five big questions, phrased simply:

  • What is the best way to represent pictures and video inside the model so it’s good at both understanding them and creating them?
  • Does mixing different kinds of data (text, videos, image–text pairs, and action–video sequences) help or hurt the model?
  • Can training on these mixed data sources naturally teach the model “world modeling” (predicting how the world changes), without special tricks?
  • Which model designs work best when handling multiple modalities (words and visuals) at once?
  • How should we scale (grow) these models—do language and vision need the same amount of data and parameters, or are they different?

How they did it (methods in everyday language)

Think of the model as a single brain that can process sequences of tokens. Text is already a sequence of tokens (words and subwords). Images and videos are turned into sequences too, using an “encoder” that converts each picture/frame into compact tokens.

  • Learning text: The model does “next-token prediction.” That means it guesses the next word, one by one, like finishing a sentence.
  • Learning visuals: The model uses a diffusion-like process, which you can imagine as learning to remove noise from a blurry picture step by step until it becomes clear again. In training, they add noise to image tokens and teach the model to clean it up.

To make visual tokens, they test several encoders:

  • VAE: a classic image compressor.
  • Semantic encoders (like SigLIP 2, DINOv2): these focus on the meaning of the image.
  • Raw pixels: feeding the model the image directly in small patches.

They also try a Representation Autoencoder (RAE) setup: use a strong semantic encoder for understanding and pair it with a decoder that can reconstruct images, so the same representation works for both understanding and generation.

Data mixtures they train on:

  • Plain text.
  • Raw videos (without captions).
  • Image–text pairs (pictures with captions).
  • Action-conditioned video (sequences where a stated action leads to a next view), used for “world modeling.”

Model design:

  • A single decoder-only Transformer backbone processes both text and visual tokens.
  • They compare:
    • Shared layers vs. modality-specific sub-layers (so text tokens and vision tokens can get tailored processing).
    • Mixture-of-Experts (MoE): imagine a big team of specialists (experts). For each token, a “router” picks a few experts to process it. This gives lots of total capacity but low active cost per token.

Scaling analysis:

  • IsoFLOP analysis: keep compute the same and see how changing data/model size affects results. This helps reveal whether vision or language benefits more from more data or more parameters.

Key terms explained:

  • Next-token prediction: guessing the next word in a sentence.
  • Diffusion/flow matching: learning to turn a noisy picture back into a clean one, step by step.
  • Encoder/decoder: encoder turns images into compact tokens; decoder turns tokens back into images.
  • MoE (Mixture-of-Experts): many small “specialist” networks; the model learns which ones to use for each token.

What they found (main results)

Here are the big takeaways, explained simply:

  • One visual representation can do it all. Using an RAE-style approach with a semantic encoder (like SigLIP 2) gives great performance for both understanding images and generating them. You don’t need separate visual systems for the two tasks.
  • Text and vision help each other. Adding video and image–text data doesn’t hurt the model’s language ability much, and often helps vision a lot—especially for text-to-image tasks where better language understanding improves how well the picture matches the prompt.
  • Diverse, general training beats narrow, specialized training. For tasks like VQA (answering questions about images) and world modeling (predicting how the next view changes after an action), adding general data (text, video, image–text) helps more than just pouring in more task-specific data. In fact, the model can learn most of the “world modeling” skill from general video, needing only a tiny bit of action-labeled data to align it.
  • World modeling emerges naturally. By representing actions as plain text (numbers or phrases like “go forward”), the same multimodal model can predict future frames and even follow free-form language commands (like “move into the light”), without special action modules.
  • Mixture-of-Experts makes scaling efficient and smart. MoE lets the model grow a lot of total capacity without costing more per token, and it naturally learns specialization:
    • More experts focus on language (language tends to need more parameters).
    • Vision experts are more “general-purpose” across the denoising steps and are shared between image understanding and image generation.
    • Having some modality-specific shared experts (one for text, one for vision) improves performance further.
  • Scaling asymmetry: vision vs. language. Vision is more data-hungry; language is more parameter-hungry. In other words, to get better at vision you mainly need more diverse visual data, while language benefits more from larger model capacity. MoE helps balance these needs nicely in a single model.

Why these results matter:

  • A single, simpler visual representation reduces complexity and makes training and inference easier.
  • Unified training from scratch reveals true multimodal learning dynamics, instead of being biased by starting from a language-only model.
  • The synergy between text and visuals means we can build stronger, more flexible models by training them together, not separately.

Why it matters (implications and impact)

  • Simpler and stronger multimodal models: Using one shared visual representation (RAE with a semantic encoder) means fewer parts to manage and better performance in both understanding and generation.
  • Better use of data: The visual world is huge and endless, while high-quality text is limited. Training on videos and image–text pairs taps into that rich signal and teaches the model about real-world dynamics, not just words.
  • Emergent world modeling: With mostly general multimodal training plus a small amount of action-aligned data, models can learn to predict and plan in the physical world—an important step toward helpful agents and robotics.
  • Smarter scaling with MoE: Because language and vision scale differently, MoE provides a practical way to grow unified models efficiently, giving language the capacity it needs while feeding vision the data it craves.
  • A roadmap for future models: The paper offers concrete design tips—use RAE-like visual representations, train on diverse multimodal data, represent actions as text, and adopt MoE with some modality-aware choices—to build truly unified AI systems that see, talk, and act.

In short

Train one model on both words and visuals from the start. Use a unified visual representation, mix diverse data, and scale with Mixture-of-Experts. You’ll get a model that reads, sees, creates images, answers questions about them, and can even predict how the world changes—often with surprisingly little task-specific data.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list identifies concrete gaps, uncertainties, and unexplored directions that remain open based on the paper’s methods, analyses, and evaluations.

  • “From scratch” multimodal pretraining is confounded by frozen, pretrained visual encoders/decoders (e.g., SigLIP 2, RAE decoders). Quantify how much capability is inherited versus learned by training encoders/decoders end-to-end from random initialization and comparing to the frozen setup.
  • RAE-based decoders are used off-the-shelf without reconstruction/fidelity analysis. Measure reconstruction quality, calibration, and downstream impact of decoder mismatch across domains; assess whether end-to-end finetuning of the decoder improves generation and VQA.
  • Video is encoded frame-by-frame with an image encoder at 1 FPS and block-wise masking. Evaluate true spatiotemporal video latents (e.g., video VAEs/RAEs), higher frame rates, and temporal attention priors to test whether temporal modeling limitations cap world-modeling performance.
  • Visual diffusion is trained with independent t per frame and bidirectional within-frame attention. Ablate alternative schedules (shared t across a clip, correlated noise across frames) and causal 2D/temporal attention to test their effects on coherence and controllability in video rollouts.
  • The choice and scheduling of loss weights (λ_flow vs. λ_LM) are ad hoc. Perform sensitivity analyses and curriculum schedules to determine robust weighting strategies that minimize modality interference while maximizing synergy.
  • The masking design gives bidirectional attention within frames but causal language masking. Investigate whether other masking schemes (e.g., fully causal visual masking, local 2D causal windows, bi-causal fusion layers) change cross-modal interference or alignment.
  • The definition of “vision tokens” and token-to-FLOP mapping is under-specified. Provide a standardized token accounting and compute normalization across visual representations to ensure fair IsoFLOP comparisons and reproducibility.
  • Raw pixel training is only briefly tested and underperforms. Explore whether 2D priors (relative position bias, local attention, convolutional stems), larger scale, and adapted noise schedules can close the gap to semantic encoders for both generation and understanding.
  • The superiority claim of RAE (semantic) latents over VAEs is based on limited metrics. Include broader image/video generation metrics (e.g., FID/KID, CLIPScore, human A/B tests), and test captioning/retrieval to more comprehensively validate representation choice.
  • GenEval/DPGBench may not fully capture generation quality across styles and prompts. Add human evaluations, robust text-image alignment tests, compositional generalization stress-tests, and out-of-distribution prompts.
  • VQA is evaluated after finetuning; zero-shot or few-shot results from pretraining are not reported. Quantify how much capability is present natively versus acquired via finetuning, especially for cross-modal reasoning.
  • Vision understanding evaluation is limited to VQA. Add captioning, retrieval, referential grounding, detection/segmentation, and long-form visual reasoning to assess general representation quality.
  • The reported synergy (language helping vision) may reflect improved text conditioning rather than general visual improvements. Control by evaluating unconditional generation and non-text-conditioned visual tasks.
  • World modeling is demonstrated only for navigation. Test transfer to manipulation, multimodal affordance prediction, physical dynamics forecasting, and multi-step planning tasks with varying horizons and partial observability.
  • Actions are represented as text tokens; no direct comparison to continuous vector action representations under identical setups. Quantify any performance gaps and hybrid schemes (text + continuous) for robustness and precision.
  • Planning uses CEM over predicted frames with LPIPS to the goal image. Compare to policy learning, latent-space planning objectives, success-rate metrics, obstacle-avoidance constraints, and closed-loop control in simulation and on real robots.
  • Language-driven navigation is shown qualitatively. Provide quantitative evaluations of language instruction following (e.g., success rates, path optimality, compliance under ambiguous commands) and robustness to linguistic variability.
  • Scaling asymmetry and IsoFLOP analysis are summarized but methodological details and full scaling curves are not presented. Release full fits, uncertainty bands, cross-dataset validations, and tests at larger model/data scales to confirm the asymmetry.
  • MoE is applied only to FFN layers with specific granularities; attention-layer MoE and alternative routing (token/frame-level, modality-aware gates, time-aware routing) remain unexplored.
  • Router stability, expert collapse, and modality imbalance under high sparsity are not analyzed. Report load-balance metrics, gate temperature effects, auxiliary loss sensitivities, and mitigation strategies.
  • Expert specialization analysis finds little timestep specialization; it remains unknown whether explicit timestep-conditional experts or noise-level conditioning could improve high-frequency detail or denoising efficiency.
  • Shared-expert designs are limited (one shared expert per modality). Explore multiple shared experts, depth-varying shared experts, and adaptive shared-capacity schedules to optimize capacity allocation over depth and modalities.
  • Data mixture scheduling is static. Test curricula, staged mixing, and adaptive sampling policies that respond to learning signals to further enhance cross-modal synergy and reduce distributional conflicts (e.g., caption text vs. pretraining corpus).
  • Safety, bias, privacy, and data governance are not discussed. Audit training data, measure bias/representation harms, evaluate safety in grounded tasks (e.g., unsafe action suggestions), and establish mitigation/alignment protocols for multimodal deployment.
  • Throughput, memory footprint, and wall-clock efficiency for long sequences, high FPS video, and MoE routing overhead are not reported. Profile and optimize for production constraints, and document efficiency trade-offs vs. accuracy.
  • Contamination checks are absent. Verify that evaluation sets (e.g., GenEval/DPGBench prompts, VQA benchmarks) are not present in the training mixtures (MetaCLIP/Shutterstock/etc.), and quantify any contamination risks.
  • Multilingual and cross-lingual capabilities are not evaluated; tokenization is LLaMA-3 BPE English-centric. Assess non-English performance, code-switching, and cross-lingual grounding in image/video tasks.
  • Multi-turn, interleaved multimodal dialogues and long-context reasoning are not tested despite the masking support. Evaluate conversational grounding, reference resolution across long sequences, and memory over extended video-text interactions.

Practical Applications

Below are actionable applications that follow from the paper’s findings on unified multimodal pretraining (Transfusion with diffusion for vision and next-token for language), Representation Autoencoders (RAE), multimodal data composition, emergent world modeling, and Mixture-of-Experts (MoE) architectures. Each item notes target sectors, likely tools/workflows, and feasibility dependencies.

Immediate Applications

The following can be piloted or deployed with current tooling and compute budgets similar to those used in the paper (2–3B active parameters; existing RAE decoders; standard MoE frameworks).

  • Unify vision stacks using RAE-based semantic encoders for both understanding and generation
    • Sectors: software/AI platforms, creative media, advertising, e-commerce
    • What: Replace dual-representation pipelines (e.g., VAE for T2I + CLIP/SigLIP for VQA) with a single high-dimensional semantic encoder (e.g., SigLIP 2) and an RAE decoder to support both text-to-image generation and image understanding.
    • Tools/workflows: SigLIP 2 or DINOv2 latents + RAE decoder; 25-step Euler sampler; classifier-free guidance (CFG≈3); FlexAttention for mixed masking
    • Dependencies/assumptions: Access to RAE decoders; license compatibility for encoders (e.g., SigLIP 2); image resolutions ~224 px; quality hinges on data composition and prompt alignment
  • Improve text-to-image alignment by co-training/grafting extra text tokens into T2I training
    • Sectors: creative tools, marketing, game asset pipelines
    • What: Add text-only tokens alongside vision during pretraining or continued pretraining to boost prompt fidelity and compositional alignment (synergy effect in GenEval).
    • Tools/workflows: Multimodal batches with λ_LM≈1.0, λ_flow≈3.0; data schedulers that keep a steady text stream
    • Dependencies/assumptions: Sufficient high-quality text (DCLM or equivalent); prompt distribution aligned with downstream prompts
  • Task-targeted image–text data curation: decouple I2T vs T2I sources
    • Sectors: MLOps/data engineering, foundation model providers
    • What: Use web-crawled captions (e.g., MetaCLIP) for understanding (I2T) and high-aesthetic sources (e.g., Shutterstock) for generation (T2I); avoid recaption distributions that drift far from your language corpus when language quality is critical.
    • Tools/workflows: Caption-distribution diagnostics (e.g., cosine distance to pretraining corpus); per-objective dataset routing
    • Dependencies/assumptions: Rights-cleared access to aesthetic datasets; monitoring language perplexity on in-domain corpora to detect drift
  • Drop-in modality-specific FFNs or MoE layers to improve performance at fixed compute
    • Sectors: cloud AI, enterprise AI systems, open-source model stacks
    • What: Replace shared FFNs with modality-specific FFNs or adopt MoE with per-modality shared experts to improve both language perplexity and visual metrics without increasing active FLOPs.
    • Tools/workflows: Top-k gated MoE (e.g., Switch/DeepSpeed-MoE/Tutel/GShard); per-modality shared expert configuration; granularity around G≈16
    • Dependencies/assumptions: Stable routing with load-balancing loss; inference-time MoE support in serving stack; QA for token-level routing correctness
  • Action-as-text interfaces for simulators and lab robotics
    • Sectors: robotics (R&D), simulation platforms, academic labs
    • What: Encode actions as textual tokens (e.g., WASD, numeric deltas) for video-conditioned next-state prediction and zero-shot planning via Cross-Entropy Method (CEM); use LPIPS to evaluate rollouts.
    • Tools/workflows: NWM-style training sequences; CEM (N≈120, horizon≈8); LPIPS-based planning objective
    • Dependencies/assumptions: Simulator access and logged trajectories; diffusive vision head and a semantic encoder; real-time constraints modest (planning loop may be slow)
  • Preview and plan visual navigation in creative/XR pipelines
    • Sectors: real estate virtual tours, cinematography previsualization, game engines, XR
    • What: Use language-described camera moves to roll out future frames for storyboards or virtual tours; leverage the emergent world modeling behavior.
    • Tools/workflows: Text-conditioned denoising rollouts; action prompts (“pan left 10°,” “move forward 2 m”)
    • Dependencies/assumptions: Consistent scene encoding (frames at 1 FPS or similar); acceptable latency for multi-step diffusion
  • Data-budgeting and collection guidance: vision is more data-hungry than language
    • Sectors: MLOps, data strategy, foundation model teams
    • What: Prioritize large-scale video ingestion (unlabeled is valuable) and maintain text streams; expect diminishing returns from over-scaling task-specific datasets vs. broad multimodal pretraining.
    • Tools/workflows: IsoFLOP-informed planning; token-budget schedulers; out-of-domain validation (e.g., “Notes” perplexity)
    • Dependencies/assumptions: Storage/ETL for large video corpora; data governance for video rights/privacy
  • Rapid VQA improvements through multimodal pretraining + light finetuning
    • Sectors: enterprise support tools, accessibility apps, education
    • What: Pretrain on mixed video/image–text + small VQA finetune to surpass larger VQA-only models; deploy for document/image assistance and classroom tools.
    • Tools/workflows: Cambrian-7M finetuning recipe; SigLIP 2 + RAE latents; few-epoch VQA fine-tunes
    • Dependencies/assumptions: VQA datasets matching target domain; evaluation across bias/safety benchmarks for deployment
  • Leaner infrastructure by training from scratch when LLM dependence is costly or restricted
    • Sectors: startups, regulated industries
    • What: Train a 2–3B-parameter backbone from scratch with multimodal data if licensing/compliance prevents using large pretrained LLMs; maintain solid VQA/generation without inherited LLM weights.
    • Tools/workflows: Decoder-only architecture; Transfusion-style joint training; FlexAttention hybrid masking
    • Dependencies/assumptions: Compute budget similar to paper; careful data composition to avoid language degradation

Long-Term Applications

These require additional research, scaling, engineering, or validation (e.g., larger models, higher frame rates, safety, latency optimizations, domain certification).

  • Language-driven robotics and embodied AI with minimal in-domain data
    • Sectors: logistics/warehousing, home robotics, field robotics
    • What: Use unified multimodal models to plan and execute actions from free-form language in real environments; leverage video-heavy pretraining and a small amount of task data.
    • Tools/workflows: On-robot MoE backbones; closed-loop planning replacing CEM with learned planners; safety monitors
    • Dependencies/assumptions: Real-time control; robust sim2real; safety and regulatory compliance; streaming sensors at higher FPS than 1
  • General-purpose world models for AR/VR navigation and assistance
    • Sectors: XR/AR navigation, accessibility, tourism
    • What: Predict and render future views to assist navigation or preview layouts; interpret natural language commands to guide movement and gaze.
    • Tools/workflows: Low-latency diffusion or distillation to fast predictors; on-device MoE dispatch; caching/foveated rendering
    • Dependencies/assumptions: Efficient distillation of diffusion to real-time models; battery/compute constraints; precise SLAM integration
  • Cross-domain vision–language assistants that both understand and generate rich media
    • Sectors: office productivity, customer support, education
    • What: Multimodal copilots that analyze images/screens and generate images or step-by-step visual guides from text queries; unified representation reduces complexity and latency.
    • Tools/workflows: Unified encoders with RAE; MoE for efficient scaling; instruction tuning and tool-use integration
    • Dependencies/assumptions: Guardrails for hallucinations; privacy-preserving on-device or hybrid processing; UI/UX for visual generation feedback
  • Raw-pixel training at scale to reduce encoder dependence and further unify modeling
    • Sectors: foundational AI research, hyperscalers
    • What: Train directly on pixels with x-pred parameterization at massive scale to close the gap with semantic encoders for generation while retaining strong understanding.
    • Tools/workflows: Larger backbones; improved noise schedules; curriculum on resolutions
    • Dependencies/assumptions: Substantially more compute; improved optimization stability; scalable data pipelines for high-res video
  • Multisensor world models (audio, depth, proprioception) with MoE specialization
    • Sectors: robotics, smart devices, autonomous systems
    • What: Extend architecture to additional modalities; exploit MoE to specialize experts by modality and fuse later layers (observed separate-then-integrate pattern).
    • Tools/workflows: Modality-routed MoE with per-modality shared experts; late-fusion layers; self-supervised alignment
    • Dependencies/assumptions: Time-synchronized datasets; serving support for multi-branch routing; calibration across sensors
  • Safety-critical forecasting and planning in transportation and healthcare
    • Sectors: autonomous driving, medical imaging, surgical robotics
    • What: Predict future states (e.g., frames, scans) and plan actions from language or control specifications; unify perception and generation for counterfactuals.
    • Tools/workflows: Verified planners; robust uncertainty estimation; regulatory-grade evaluation suites
    • Dependencies/assumptions: High-frame-rate, high-resolution data; stringent validation, interpretability, and governance; domain-specific finetuning
  • Cinematography and virtual production: text-guided camera motion and scene evolution
    • Sectors: film/TV, game development, virtual production
    • What: Author complex camera paths and scene changes via natural language; preview multiple trajectories rapidly for creative iteration.
    • Tools/workflows: Fast rollouts via diffusion distillation; integration with DCC tools (e.g., Unreal/Unity); script-to-scene pipelines
    • Dependencies/assumptions: Latency reduction; consistent asset/style control; IP rights for training corpora
  • Energy- and compute-aware scaling strategies informed by modality asymmetry
    • Sectors: cloud providers, policy/governance
    • What: Allocate compute toward more video tokens (data-hungry vision) while using MoE to supply larger parameter capacity for language without increasing active compute.
    • Tools/workflows: IsoFLOP planning dashboards; autoscaling MoE expert pools; dataset/compute co-optimization
    • Dependencies/assumptions: Accurate measurement of energy per training regime; standardized reporting; organizational buy-in
  • Data governance and provenance frameworks for large-scale video pretraining
    • Sectors: policy, legal, data marketplaces
    • What: Establish rights/consent frameworks, provenance tracking, and watermarking for massive video ingestion needed by data-hungry vision components.
    • Tools/workflows: Data lineage systems; policy-compliant collection; synthetic augmentation pipelines
    • Dependencies/assumptions: Evolving legal norms for video; cross-institutional standards and audits
  • Scalable multimodal serving with token-level routing
    • Sectors: inference platforms, edge/cloud hybrid systems
    • What: Production-grade serving for MoE multimodal models with per-token routing and dynamic expert activation to keep latency flat while scaling capacity.
    • Tools/workflows: Partition-aware KV-cache + MoE dispatch; load balancing; expert hot-swapping
    • Dependencies/assumptions: Mature serving frameworks with MoE support; observability for routing and expert health

Notes on overarching assumptions and risks:

  • Compute and data: Vision benefits most from more diverse and larger video datasets; ensure rights and privacy. Models >10B total parameters (with MoE) may be needed for top-tier results.
  • Distribution shift: Caption distributions can degrade language perplexity; monitor and curate text sources. Instruction tuning still needed for assistant-style behavior.
  • Latency: Diffusion-based generation can be slow; distillation and specialized samplers are required for interactive use.
  • Safety and alignment: Multimodal systems can hallucinate or misinterpret prompts; add guardrails, red-teaming, and domain-specific evaluation before deployment.

Glossary

  • Absolute Trajectory Error (ATE): A metric that quantifies the deviation between a predicted navigation trajectory and the ground-truth path. "and report the absolute trajectory error (ATE) and relative pose error (RPE)."
  • Action-conditioned video: Video data where visual sequences are paired with corresponding actions used to condition predictions. "and even action-conditioned video."
  • Autoregressive: A modeling approach that predicts the next element in a sequence conditioned on all previous elements. "We train a single autoregressive model with next-token prediction for text and next-visual state prediction with flow matching."
  • Block-wise causal mask: An attention masking scheme where tokens within the same block (e.g., image frame) attend bidirectionally while maintaining causal structure across blocks. "visual data uses a block-wise causal mask where each video frame (or static image) is a separate block."
  • BPE tokenizer: Byte Pair Encoding; a subword tokenization method commonly used in LLMs. "Text is tokenized with a standard BPE tokenizer from LLaMA-3~\citep{grattafiori2024llama3} and trained with next-token prediction."
  • Cambrian-7M: A large-scale dataset for visual question answering used for finetuning and evaluation. "All VQA results are reported with 1 epoch of finetuning on the Cambrian-7M dataset."
  • Causal mask: An attention mask that prevents a token from attending to future tokens, supporting autoregressive generation. "Language modeling uses a standard causal mask."
  • CC12M: A large image dataset used for validation and measuring diffusion loss. "diffusion loss on the held-out CC12M~\citep{changpinyo2021conceptual} validation set"
  • CFG: Classifier-Free Guidance; a guidance strength parameter in diffusion models to balance fidelity and diversity. "We use a fixed CFG of 3.0 to run all image generation evaluations and do not tune this value."
  • Classifier-free guidance: A technique in diffusion models that improves conditioning without a classifier by mixing conditional and unconditional predictions. "enable classifier-free guidance at inference time~\citep{ho2022classifier}."
  • Coefficient of Variation (CV): A normalized measure of dispersion used here to assess expert specialization across diffusion timesteps. "We test this by measuring the Coefficient of Variation (CV) of expert selection rates across 10 timestep bins"
  • Cross-Entropy: A loss function for training probabilistic models, especially in language modeling. "standard autoregressive cross-entropy:"
  • Cross-Entropy Method (CEM): A stochastic optimization technique used for planning by iteratively sampling and refining action sequences. "For evaluation, we perform zero-shot planning using the Cross-Entropy Method (CEM)~\citep{rubinstein1997optimization}"
  • DCLM: A large-scale web text corpus used for pretraining and evaluation. "large-scale web text (DCLM~\citep{li2024datacomp})"
  • Decoder-only Transformer: A Transformer architecture composed solely of decoder blocks, used for autoregressive generation. "We use a decoder-only Transformer backbone~\citep{vaswani2017attention}"
  • DINOv2-L: A large self-supervised visual encoder providing semantic image representations. "DINOv2-L~\citep{Dinov2} and WebSSL-L~\citep{fan2025scaling}."
  • Diffusion: A generative modeling framework that iteratively denoises latent variables to produce data samples. "using next-token prediction for language and diffusion for vision"
  • Diffusion Forcing: A training strategy aligning denoising dynamics by injecting independent noise schedules per image/frame. "This independent noise injection aligns the training dynamics with Diffusion Forcing~\citep{chen2024diffusion}."
  • Diffusion loss: The training objective measuring error in predicted denoising velocity or target in diffusion models. "diffusion loss on the held-out CC12M~\citep{changpinyo2021conceptual} validation set"
  • DPGBench: A benchmark suite for evaluating text-to-image generation quality. "image generation score on DPGBench~\citep{hu2024ella} and GenEval~\citep{ghosh2023geneval}"
  • Encoder-agnostic: An architecture design that can accept visual tokens from any encoder without modification. "Our architecture is encoder-agnostic and can accept any visual input."
  • Euler sampler: A numerical sampler for diffusion inference, here using 25 steps for denoising. "For visual generation, the model denoises using a 25-step Euler sampler."
  • Feed-Forward Network (FFN): The per-token MLP component in Transformer blocks that expands and contracts hidden dimensions. "modality-specific FFNs (one for text and one for vision tokens) instead of shared FFNs by default."
  • FlexAttention: An attention implementation enabling flexible masking strategies for multimodal sequences. "implemented via FlexAttention~\citep{dong2024flex}."
  • FLUX.1: A visual autoencoding model providing low-dimensional latents for image generation tasks. "We study three families of encoders. First, we study VAEs from Stable Diffusion (SD-VAE)~\citep{LDM} and FLUX.1~\citep{flux}."
  • Flow matching: A training paradigm for diffusion that learns velocity fields to transform noise into data. "Following flow matching formulations~\citep{fm,rf}, we sample tU[0,1]t\sim\mathcal{U}[0,1]"
  • GenEval: A benchmark for evaluating semantic and compositional alignment in text-conditioned image generation. "image generation score on DPGBench~\citep{hu2024ella} and GenEval~\citep{ghosh2023geneval}"
  • Granularity (G): The ratio setting expert size vs. count in MoE, controlling how finely capacity is partitioned. "We first analyze the role of expert granularity GG"
  • IsoFLOP analysis: A controlled scaling methodology that keeps compute (FLOPs) constant to derive scaling laws. "Through IsoFLOP analysis, we compute scaling laws for both modalities"
  • I/T (Image–Text) data: Paired image–caption data used to learn visual understanding and text-conditioned generation. "I/T data enables visual capabilities."
  • LLaMA-3: A modern LLM whose tokenizer is used for text preprocessing. "from LLaMA-3~\citep{grattafiori2024llama3}"
  • Load-balancing loss: An auxiliary regularizer in MoE that encourages uniform expert utilization. "despite the use of an auxiliary load-balancing loss that encourages uniform utilization."
  • LPIPS: Learned Perceptual Image Patch Similarity; a perceptual metric for image similarity used in planning. "minimize the LPIPS distance~\citep{lpips}"
  • MetaCLIP: A large-scale image–text dataset derived from web data for paired training. "paired image--text data (MetaCLIP~\citep{xu2023demystifying} and in-house Shutterstock)"
  • Mixture-of-Experts (MoE): A sparse neural architecture routing tokens to specialized expert networks to increase capacity efficiently. "Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization."
  • Modality-specific FFN: Separate FFN layers per modality to improve specialized processing within a unified backbone. "we use modality-specific FFNs (one for text and one for vision tokens) instead of shared FFNs by default."
  • Navigation World Model (NWM): A framework and dataset for predicting future visual states from actions in navigation tasks. "We extend our evaluation to the Navigation World Model (NWM) setting"
  • Noise schedule: The time-dependent noise weighting used in diffusion training that controls difficulty of denoising. "we shift the noise schedule towards the noisier end of the spectrum"
  • Per-Modality Shared Expert: A MoE design where each modality has a dedicated always-active shared expert. "Per-Modality Shared Experts outperform Global Shared Experts."
  • Perplexity (PPL): A measure of LLM uncertainty or quality on a text corpus. "text perplexity (PPL) on the held-out DCLM~\citep{li2024datacomp} validation set"
  • Representation Autoencoder (RAE): A high-dimensional latent autoencoder enabling diffusion-based generation and strong visual understanding. "Representation Autoencoder (RAE) provides an optimal unified visual representation"
  • Relative Pose Error (RPE): A metric for evaluating accuracy of relative motion estimation in navigation. "and report the absolute trajectory error (ATE) and relative pose error (RPE)."
  • Scaling asymmetry: The phenomenon where different modalities scale differently with data or parameters (vision needs more data). "uncover a scaling asymmetry: vision is significantly more data-hungry than language."
  • Scaling laws: Empirical relationships showing how performance improves with data, compute, or parameters. "compute scaling laws for both modalities"
  • SD-VAE: Stable Diffusion Variational Autoencoder used as a visual latent encoder/decoder. "For the VAE encoders, we use the off-the-shelf pretrained decoders from SD-VAE~\citep{LDM} and FLUX.1~\citep{flux}."
  • SigLIP 2 So400M: A high-capacity semantic visual encoder used as the default image encoder. "By default, we use SigLIP 2 So400M, unless otherwise specified."
  • Sparsity: The MoE property of activating only a subset of experts per token to increase total capacity at fixed compute. "Sparsity scales multimodal performance."
  • Top-k routing: An MoE mechanism where the router selects the top-k experts per token. "G=1G{=}1 corresponds to a baseline of 16 large experts (d=8192d{=}8192) with Top-1 routing, while G=64G{=}64 yields 1024 small experts (d=128d{=}128)\footnote{Due to implementation constraints in TorchTitan~\citep{torchtitan}, we use 1008 experts for the G=64G=64 setting.} with Top-64 routing."
  • Transfusion framework: A unified training setup combining autoregressive language modeling with diffusion-based vision modeling. "We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision,"
  • U-Net: A convolutional architecture often used in diffusion models; replaced here by linear projections for simplicity. "we use simple linear projection layers instead of a U-Net~\citep{unet}."
  • v-pred: Diffusion target parameterization predicting velocity toward clean data, often used with flow matching. "For RAE (SigLIP 2), xx-pred (solid) outperforms vv-pred (dashed) on image generation; for VAE (FLUX.1), vv-pred is better."
  • Variational Autoencoder (VAE): A generative encoder–decoder model that learns a low-dimensional latent distribution of images. "Variational Autoencoders (VAE) and semantic representations to raw pixels."
  • Velocity field: The vector field predicted in flow matching that guides latents from noise to data. "The model predicts a velocity field vθ(zt,t,context)v_\theta(z_t, t, \text{context})"
  • Visual Question Answering (VQA): A task where models answer questions about images, evaluating visual understanding. "average VQA accuracy on the 16 Cambrian evaluation benchmarks~\citep{tong2024cambrian}."
  • WebSSL-L: A large self-supervised visual encoder used for high-dimensional latents. "DINOv2-L~\citep{Dinov2} and WebSSL-L~\citep{fan2025scaling}."
  • World modeling: Learning predictive models of the environment’s dynamics from multimodal data. "unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training;"
  • x-pred: Diffusion target parameterization predicting the clean sample directly, useful in high-dimensional latents. "predicting z0z_0 directly (``xx-pred'')~\citep{li2025jit}."
  • Zero-shot planning: Planning without task-specific training, using a learned model to optimize action sequences. "For evaluation, we perform zero-shot planning using the Cross-Entropy Method (CEM)~\citep{rubinstein1997optimization}"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 21 tweets with 1395 likes about this paper.