- The paper introduces a novel Y-shaped Transformer that separates shared shallow layers from task-specific deep branches, enabling superior image understanding and generation performance.
- Modality alignment analysis using mutual-kNN metrics shows that deep layers enhance semantic features for understanding while early layers capture spatial details for generation.
- Empirical results across benchmarks confirm that UniFork outperforms fully shared models in both tasks, offering an efficient, scalable solution for unified multimodal AI.
UniFork: Modality Alignment for Unified Multimodal Understanding and Generation
Motivation and Background
The pursuit of unified architectures for multimodal understanding and generation is a central challenge in current vision-language modeling. Recent works have converged on Transformer-based approaches where cross-modal inputs are embedded within a shared representation space. Despite this unified approach, practical deployments face significant trade-offs: image understanding tasks require progressive deep-semantic cross-modal alignment, while generation tasks prioritize initial alignment followed by late-stage decoupling to recover spatial fidelity. This divergent representational demand induces task interference and performance degradation when fully sharing network parameters—a problem accentuated under the next-token prediction (NTP) paradigm.
Prevailing frameworks attempt to bridge these differences either via external diffusion heads, hybrid loss objectives, or dual-path encoders [wu2025janus, januspro, showo, zhou2024transfusion], but these increase system complexity and deviate from the streamlined NTP principle. The intrinsic representational conflict in fully shared Transformers remains underexplored, raising the need for architecture-driven solutions that balance shared learning with specialization.
Modality Alignment Analysis
UniFork conducts a systematic analysis of modality alignment dynamics across Transformer depth, using mutual-kNN metrics to quantify feature alignment between image and text tokens for both generation and understanding tasks. Key experimental findings include:
- Image understanding: Characterized by monotonically increasing modality alignment scores with network depth. Deep layers aggregate and reinforce semantically grounded cross-modal features essential for comprehension tasks.
- Image generation: Exhibits a rise-then-fall alignment trajectory—early layers focus on prompt-image semantic grounding, while late layers decouple to synthesize high-frequency visual details and spatial attributes.
In models with fully shared Transformer backbones (e.g., Emu3-base), alignment curves for both tasks converge to a compromised intermediate, failing to satisfy optimal requirements for either. Task-specific fine-tuning recovers the expected alignment trends, confirming that representational compromise results from parameter sharing. These results are consistent across both short and lengthy prompt benchmarks, reaffirming the generalizability of divergent alignment dynamics.
UniFork Architecture
UniFork adopts a Y-shaped Transformer backbone initialized from Qwen2.5-0.5B, operationalizing the insight that cross-task semantic learning is best confined to shallow layers, with decoupled deep branches enabling task-specific specialization. Architectural features include:
- Shared shallow layers: Facilitate cross-modal semantic representation, supporting both understanding and generation during initial encoding.
- Task-specific deep branches: Structurally identical yet independently parameterized branches optimize for semantic reinforcement (understanding) or spatial detail reconstruction (generation).
- Single visual tokenizer: Employs the VILA-U tokenizer [vilau]—chosen for its balance between reconstruction quality and cross-modal alignment.
- Autoregressive image head: Predicts quantized token codes for images using residual vector quantization [rqvae].
This shared-then-split paradigm is highly modular: reducing branch length collapses the architecture to fully shared baselines (e.g., Emu3), while eliminating sharing approaches the mixture-of-Transformers configuration in models like BAGEL [bagel].
Training and Optimization
UniFork's training pipeline unfolds in three stages:
- Visual alignment pretraining: Frozen LLM parameters; visual connector and image head trained on dual tasks over ImageNet-1K, Laion-En, and COYO data. Prompts and captions formatted to maintain alignment consistency.
- Joint optimization: Unfreezes all components for multitask pretraining and instruction tuning using broader and more diverse datasets. Format maintains the unified INPUT:MSG-RESPONSE structure for both modalities.
- Task-specific fine-tuning: Isolation of branch parameters, enabling dedicated optimization without shared-layer interference.
The loss is standard cross-entropy over autoregressively modeled tokens, applied separately for image and text tokens per task. No complex task weighting heuristics are imposed.
Experimental Results
Ablation Analysis
Comparative studies involving expert models, fully shared LLMs, and variants of UniFork (with identical parameter budgets) yield:
- Superior trade-off: UniFork consistently outperforms fully shared architectures on both tasks and matches or exceeds task-specific experts. On critical benchmarks, including MME-P, VQAv2, SEED-I, Geneval, and MJHQ, UniFork achieves significant gains in both understanding and generation metrics.
- Scaling efficiency: Moderate parameter scaling activates substantial performance improvements, demonstrating architecture-limited efficiency.
Multimodal Benchmarks
- Image understanding: UniFork achieves 85.8 POPE and 55.2 SEEDv1 scores using only 0.5B active inference parameters, surpassing several larger unified and expert models [showo, mobilevlm, IDEFICS-9B].
- Image generation: The main UniFork variant obtains a 0.46 overall accuracy on GenEval (+39% over its ablation baseline, outperforming LDM, SDv1.5, LlamaGen, LWM, Chameleon) and a 10.6 FID on MJHQ-30K (35% improvement, eclipsing Show-o and LWM).
Qualitative analyses confirm that architectural decoupling improves spatial detail recovery and semantic grounding in both understanding and generation tasks.
Modality Alignment Verification
Final UniFork models reproduce expert-like alignment patterns post-training, with monotonic increases for understanding and rise-fall trajectories for generation. This confirms that the Y-shaped architecture resolves representational conflicts inherent in unified multimodal modeling.
Implications and Future Directions
UniFork establishes that a shared-then-split Transformer backbone is a minimal, effective design for unified multimodal understanding and generation. Its parameter modularity enables scalability and flexible deployment, with empirically validated improvements in both domains at modest cost.
Practically, UniFork stabilizes multitask training by eliminating the need for delicate data balancing or hybrid objectives. Theoretically, the modality alignment methodology provides a robust analytical tool for diagnosing and optimizing cross-modal feature flow, capable of extension to novel modalities.
Future work should systemically explore:
- Parameter ratio optimization between shared and task-specific layers, potentially conditional on data/task complexity.
- Integration of richer visual tokenizers and higher fidelity pretraining corpora to further enhance generation capability.
- Extension to tri-modal or multimedia architectures (audio, video, 3D) building on UniFork’s alignment principles.
- Advanced interleaved data scheduling and large-scale instruction tuning for emergent reasoning abilities and generalization.
Conclusion
UniFork advances the design of unified multimodal models by resolving representational conflicts at the structural level. Through comprehensive alignment analysis and ablation, UniFork achieves strong numerical performance and efficient scalability. Its shared-then-split backbone is an effective baseline for future research into unified multimodal systems and offers a path for extending alignment-driven reasoning to arbitrary modalities (2506.17202).