Skywork UniPic 3.0: Unified Multimodal Framework
- Skywork UniPic 3.0 is a unified multimodal framework that supports both single-image editing and multi-image composition, emphasizing Human-Object Interaction (HOI) synthesis.
- It integrates vision-language encoding, image tokenization, and a diffusion-based backbone under a fixed global pixel budget to handle 1–6 input images efficiently.
- The framework achieves state-of-the-art benchmarks through rigorous data curation, multi-stage training, and accelerated inference with only 8 function evaluations.
Skywork UniPic 3.0 is a unified multimodal generative framework designed for both single-image editing and multi-image composition, emphasizing Human-Object Interaction (HOI)-centric synthesis. It supports an arbitrary number (1–6) and resolution of input images under a shared pixel budget, outputs high-fidelity compositions at target resolutions, and achieves state-of-the-art performance benchmarks by recasting composition as a sequence-modeling task with efficient inference (Wei et al., 22 Jan 2026).
1. Architecture and Unified Sequence Construction
Skywork UniPic 3.0 integrates various subsystems to address the complexity of compositional image synthesis across modalities. The framework consists of:
- Vision–Language Encoder: Utilizes Qwen2.5-VL to encode user instructions alongside metadata (e.g., capture resolution, shape descriptors).
- Image Tokenizer: Employs a VAE encoder mapping each image to a latent tensor .
- Patch-Wise Packing: Latents are transformed into sequences of patch tokens by , with , .
- Backbone Diffusion Model: An MMDiT (latent diffusion transformer) that operates over a unified sequence of tokens.
For processing, reference images and the output image are packed as a concatenated sequence: with a parallel sequence of shape descriptors , where each .
To preserve computational tractability, the model enforces a global pixel budget: ensuring that the combined input-output resolution remains bounded.
2. Sequence-Modeling Paradigm for Unified Tasks
Both single-image editing and multi-image composition are formulated uniformly as sequence-to-sequence conditional synthesis. Context tokens (representing references) condition the generation of output tokens : with the cross-entropy loss objective
Variable-length reference handling is inherently supported by concatenation, subject to pixel budget and accompanied by explicit shape descriptors to maintain token-to-patch correspondence during generation and reconstruction.
3. Data Curation, Filtering, and Synthesis Pipeline
A stringent and automated three-stage data pipeline is central to Skywork UniPic 3.0’s empirical performance, with an explicit focus on HOI-centric compositions:
- Data Collection:
- Person images: 18,000 from CC12M, captioned with InternVL3.5-38B for fine-grained pose and apparel labeling.
- Object categories: 300 human-interactive classes generated with GPT-4o; 5,000 text prompts per class rendered via Qwen-Image, yielding 150,000 images.
- Data Filtering:
- InternVL3.5-38B: holistic quality scoring in ; retain instances with scores .
- Face/body detectors: threshold face visibility () and subject occupancy ().
- CLIPScore and resolution: prompt-image consistency ≥ threshold, object images ≥ .
- Synthesis:
- For in , consider all valid image combinations, subject to a conflict matrix (e.g., one person cannot hold multiple guitars).
- Compose prompts using InternVL3.5; for , targets are synthesized by NanoBanana; for , by Seedream4.0.
- Retain only samples passing stringent aesthetic and identity checks.
This procedure results in 215,000 HOI triplets, with the full dataset comprising approximately 700,000 high-quality sample pairs.
4. Training Regimen and Loss Formulations
Skywork UniPic 3.0 adopts a multi-stage, diffusion-based optimization pipeline:
- Pre-Training (Multi-Task Diffusion): Optimizes the MMDiT backbone with the flow-matching loss
where , .
- Continuous-Time Consistency Tuning: Applies a consistency loss for trajectory mapping
facilitating direct mapping from to for efficient sampling.
- Distribution Matching Distillation: Distills the student model to match a multi-step teacher, using a reverse-KL-based loss:
with as a LoRA-adapter network derived from the teacher parameters.
5. Accelerated Inference Techniques
After consistency tuning and distribution-matching distillation, inference is substantially accelerated:
- The trajectory mapping mechanism enables the model to move efficiently from the noise prior () to sample () space.
- Post-distillation, high-quality generations are achievable in only 8 function evaluations, representing a measured sampling speedup compared to typical ~100-step diffusion pipelines.
- Distribution matching ensures fidelity of the accelerated (student) sampler relative to the full-step (teacher) output distribution, maintaining output quality.
6. Quantitative and Qualitative Performance
Skywork UniPic 3.0 achieves leading results across multiple benchmarks:
| Model | ImgEdit-Bench | GEdit-Bench |
|---|---|---|
| Qwen-Image-Edit | 4.25 | 7.88 |
| Nano-Banana | 4.22 | 7.59 |
| Seedream 4.0 | 4.11 | 7.92 |
| UniPic 3.0 | 4.35 | 7.79 |
| Model | 2–3 Inputs | 4–6 Inputs | Overall |
|---|---|---|---|
| Qwen-Image-Edit | 0.7705 | 0.4793 | 0.6249 |
| Nano-Banana | 0.7982 | 0.6466 | 0.7224 |
| Seedream 4.0 | 0.7997 | 0.6197 | 0.7088 |
| UniPic 3.0 | 0.8214 | 0.6296 | 0.7255 |
Qualitative analysis highlights advances in natural occlusion, facial consistency, and instruction fidelity, particularly for complex HOI tasks.
7. Dataset and Code Availability
The full codebase, trained models, and MultiCom-Bench dataset are publicly released at https://skywork-unipic-v3.github.io. The dataset includes curated HOI triplet compositions, open-domain multi-image, and single-image editing data, supporting further research in unified multimodal image synthesis and compositional generative modeling (Wei et al., 22 Jan 2026).