UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Published 20 Jun 2025 in cs.CV | (2506.17202v1)

Abstract: Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel Y-shaped Transformer that separates shared shallow layers from task-specific deep branches, enabling superior image understanding and generation performance.
Modality alignment analysis using mutual-kNN metrics shows that deep layers enhance semantic features for understanding while early layers capture spatial details for generation.
Empirical results across benchmarks confirm that UniFork outperforms fully shared models in both tasks, offering an efficient, scalable solution for unified multimodal AI.

UniFork: Modality Alignment for Unified Multimodal Understanding and Generation

Motivation and Background

The pursuit of unified architectures for multimodal understanding and generation is a central challenge in current vision-language modeling. Recent works have converged on Transformer-based approaches where cross-modal inputs are embedded within a shared representation space. Despite this unified approach, practical deployments face significant trade-offs: image understanding tasks require progressive deep-semantic cross-modal alignment, while generation tasks prioritize initial alignment followed by late-stage decoupling to recover spatial fidelity. This divergent representational demand induces task interference and performance degradation when fully sharing network parameters—a problem accentuated under the next-token prediction (NTP) paradigm.

Prevailing frameworks attempt to bridge these differences either via external diffusion heads, hybrid loss objectives, or dual-path encoders [wu2025janus, januspro, showo, zhou2024transfusion], but these increase system complexity and deviate from the streamlined NTP principle. The intrinsic representational conflict in fully shared Transformers remains underexplored, raising the need for architecture-driven solutions that balance shared learning with specialization.

Modality Alignment Analysis

UniFork conducts a systematic analysis of modality alignment dynamics across Transformer depth, using mutual-kNN metrics to quantify feature alignment between image and text tokens for both generation and understanding tasks. Key experimental findings include:

Image understanding: Characterized by monotonically increasing modality alignment scores with network depth. Deep layers aggregate and reinforce semantically grounded cross-modal features essential for comprehension tasks.
Image generation: Exhibits a rise-then-fall alignment trajectory—early layers focus on prompt-image semantic grounding, while late layers decouple to synthesize high-frequency visual details and spatial attributes.

In models with fully shared Transformer backbones (e.g., Emu3-base), alignment curves for both tasks converge to a compromised intermediate, failing to satisfy optimal requirements for either. Task-specific fine-tuning recovers the expected alignment trends, confirming that representational compromise results from parameter sharing. These results are consistent across both short and lengthy prompt benchmarks, reaffirming the generalizability of divergent alignment dynamics.

UniFork Architecture

UniFork adopts a Y-shaped Transformer backbone initialized from Qwen2.5-0.5B, operationalizing the insight that cross-task semantic learning is best confined to shallow layers, with decoupled deep branches enabling task-specific specialization. Architectural features include:

Shared shallow layers: Facilitate cross-modal semantic representation, supporting both understanding and generation during initial encoding.
Task-specific deep branches: Structurally identical yet independently parameterized branches optimize for semantic reinforcement (understanding) or spatial detail reconstruction (generation).
Single visual tokenizer: Employs the VILA-U tokenizer [vilau]—chosen for its balance between reconstruction quality and cross-modal alignment.
Autoregressive image head: Predicts quantized token codes for images using residual vector quantization [rqvae].

This shared-then-split paradigm is highly modular: reducing branch length collapses the architecture to fully shared baselines (e.g., Emu3), while eliminating sharing approaches the mixture-of-Transformers configuration in models like BAGEL [bagel].

Training and Optimization

UniFork's training pipeline unfolds in three stages:

Visual alignment pretraining: Frozen LLM parameters; visual connector and image head trained on dual tasks over ImageNet-1K, Laion-En, and COYO data. Prompts and captions formatted to maintain alignment consistency.
Joint optimization: Unfreezes all components for multitask pretraining and instruction tuning using broader and more diverse datasets. Format maintains the unified INPUT:MSG-RESPONSE structure for both modalities.
Task-specific fine-tuning: Isolation of branch parameters, enabling dedicated optimization without shared-layer interference.

The loss is standard cross-entropy over autoregressively modeled tokens, applied separately for image and text tokens per task. No complex task weighting heuristics are imposed.

Experimental Results

Ablation Analysis

Comparative studies involving expert models, fully shared LLMs, and variants of UniFork (with identical parameter budgets) yield:

Superior trade-off: UniFork consistently outperforms fully shared architectures on both tasks and matches or exceeds task-specific experts. On critical benchmarks, including MME-P, VQAv2, SEED-I, Geneval, and MJHQ, UniFork achieves significant gains in both understanding and generation metrics.
Scaling efficiency: Moderate parameter scaling activates substantial performance improvements, demonstrating architecture-limited efficiency.

Multimodal Benchmarks

Image understanding: UniFork achieves 85.8 POPE and 55.2 SEEDv1 scores using only 0.5B active inference parameters, surpassing several larger unified and expert models [showo, mobilevlm, IDEFICS-9B].
Image generation: The main UniFork variant obtains a 0.46 overall accuracy on GenEval (+39% over its ablation baseline, outperforming LDM, SDv1.5, LlamaGen, LWM, Chameleon) and a 10.6 FID on MJHQ-30K (35% improvement, eclipsing Show-o and LWM).

Qualitative analyses confirm that architectural decoupling improves spatial detail recovery and semantic grounding in both understanding and generation tasks.

Modality Alignment Verification

Final UniFork models reproduce expert-like alignment patterns post-training, with monotonic increases for understanding and rise-fall trajectories for generation. This confirms that the Y-shaped architecture resolves representational conflicts inherent in unified multimodal modeling.

Implications and Future Directions

UniFork establishes that a shared-then-split Transformer backbone is a minimal, effective design for unified multimodal understanding and generation. Its parameter modularity enables scalability and flexible deployment, with empirically validated improvements in both domains at modest cost.

Practically, UniFork stabilizes multitask training by eliminating the need for delicate data balancing or hybrid objectives. Theoretically, the modality alignment methodology provides a robust analytical tool for diagnosing and optimizing cross-modal feature flow, capable of extension to novel modalities.

Future work should systemically explore:

Parameter ratio optimization between shared and task-specific layers, potentially conditional on data/task complexity.
Integration of richer visual tokenizers and higher fidelity pretraining corpora to further enhance generation capability.
Extension to tri-modal or multimedia architectures (audio, video, 3D) building on UniFork’s alignment principles.
Advanced interleaved data scheduling and large-scale instruction tuning for emergent reasoning abilities and generalization.

Conclusion

UniFork advances the design of unified multimodal models by resolving representational conflicts at the structural level. Through comprehensive alignment analysis and ablation, UniFork achieves strong numerical performance and efficient scalability. Its shared-then-split backbone is an effective baseline for future research into unified multimodal systems and offers a path for extending alignment-driven reasoning to arbitrary modalities (2506.17202).