Skywork UniPic 3.0: Unified Multimodal Framework

Updated 29 January 2026

Skywork UniPic 3.0 is a unified multimodal framework that supports both single-image editing and multi-image composition, emphasizing Human-Object Interaction (HOI) synthesis.
It integrates vision-language encoding, image tokenization, and a diffusion-based backbone under a fixed global pixel budget to handle 1–6 input images efficiently.
The framework achieves state-of-the-art benchmarks through rigorous data curation, multi-stage training, and accelerated inference with only 8 function evaluations.

Skywork UniPic 3.0 is a unified multimodal generative framework designed for both single-image editing and multi-image composition, emphasizing Human-Object Interaction (HOI)-centric synthesis. It supports an arbitrary number (1–6) and resolution of input images under a shared pixel budget, outputs high-fidelity compositions at target resolutions, and achieves state-of-the-art performance benchmarks by recasting composition as a sequence-modeling task with efficient inference (Wei et al., 22 Jan 2026).

1. Architecture and Unified Sequence Construction

Skywork UniPic 3.0 integrates various subsystems to address the complexity of compositional image synthesis across modalities. The framework consists of:

Vision–Language Encoder: Utilizes Qwen2.5-VL to encode user instructions alongside metadata (e.g., capture resolution, shape descriptors).
Image Tokenizer: Employs a VAE encoder $f_{\mathrm{vae\text{-}enc}}$ mapping each image $I$ to a latent tensor $z \in \mathbb{R}^{1 \times C \times H' \times W'}$ .
Patch-Wise Packing: Latents are transformed into sequences of patch tokens by $s = \mathrm{pack}(z) \in \mathbb{R}^{N \times D}$ , with $N = H' W'/4$ , $D = 4C$ .
Backbone Diffusion Model: An MMDiT (latent diffusion transformer) that operates over a unified sequence of tokens.

For processing, $K$ reference images $\{I_k\}_{k=1}^K$ and the output image $O$ are packed as a concatenated sequence: $S = [s_O \,\|\, s_1 \,\|\, s_2 \,\|\cdots\|\, s_K] \in \mathbb{R}^{(N_O + \sum_{k=1}^K N_k) \times D},$ with a parallel sequence of shape descriptors $\mathcal{H} = \{h_O, h_1, ..., h_K\}$ , where each $h_i = (H'_i,W'_i)$ .

To preserve computational tractability, the model enforces a global pixel budget: $\sum_{i = O,1}^K H_i\,W_i\leq 1024^2$ ensuring that the combined input-output resolution remains bounded.

2. Sequence-Modeling Paradigm for Unified Tasks

Both single-image editing and multi-image composition are formulated uniformly as sequence-to-sequence conditional synthesis. Context tokens $\mathbf{X} = \{s_1, ..., s_K\}$ (representing references) condition the generation of output tokens $\mathbf{Y} = s_O$ : $p_\theta(\mathbf{Y}\mid\mathbf{X}) = \prod_{t=1}^T p_\theta(y_t\mid y_{< t}, \mathbf{X})$ with the cross-entropy loss objective

$\mathcal{L}_{\mathrm{CE}} = -\mathbb{E}_{(\mathbf{X},\mathbf{Y})} \sum_{t=1}^{T} \log p_\theta(y_t\mid y_{< t}, \mathbf{X})$

Variable-length reference handling is inherently supported by concatenation, subject to pixel budget and accompanied by explicit shape descriptors to maintain token-to-patch correspondence during generation and reconstruction.

3. Data Curation, Filtering, and Synthesis Pipeline

A stringent and automated three-stage data pipeline is central to Skywork UniPic 3.0’s empirical performance, with an explicit focus on HOI-centric compositions:

Data Collection:
- Person images: 18,000 from CC12M, captioned with InternVL3.5-38B for fine-grained pose and apparel labeling.
- Object categories: 300 human-interactive classes generated with GPT-4o; 5,000 text prompts per class rendered via Qwen-Image, yielding 150,000 images.
Data Filtering:
- InternVL3.5-38B: holistic quality scoring in $[0,100]$ ; retain instances with scores $\geq 75$ .
- Face/body detectors: threshold face visibility ( $\geq 90\%$ ) and subject occupancy ( $\geq 60\%$ ).
- CLIPScore and resolution: prompt-image consistency ≥ threshold, object images ≥ $768^2$ .
Synthesis:
- For $K$ in $[2,6]$ , consider all valid image combinations, subject to a conflict matrix (e.g., one person cannot hold multiple guitars).
- Compose prompts using InternVL3.5; for $K\leq 3$ , targets are synthesized by NanoBanana; for $K>3$ , by Seedream4.0.
- Retain only samples passing stringent aesthetic and identity checks.

This procedure results in 215,000 HOI triplets, with the full dataset comprising approximately 700,000 high-quality sample pairs.

4. Training Regimen and Loss Formulations

Skywork UniPic 3.0 adopts a multi-stage, diffusion-based optimization pipeline:

Pre-Training (Multi-Task Diffusion): Optimizes the MMDiT backbone with the flow-matching loss

$\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{z_0 \sim p_{\mathrm{data}}, t \sim [0,1]} \left\| F_\theta(z_t, t) - \frac{dz_t}{dt} \right\|_2^2,$

where $z_t = (1-t)z_0 + t\epsilon$ , $\epsilon \sim \mathcal{N}(0,I)$ .

Continuous-Time Consistency Tuning: Applies a consistency loss for trajectory mapping

$\mathcal{L}_{\mathrm{CM}}(\theta) = \mathbb{E}_{z_t,t} \left\| F_\theta(z_t,t) + z_t + F_{\theta^-}(z_t,t) + t\partial_t F_{\theta^-}(z_t,t) \right\|_2^2,$

facilitating direct mapping from $z_t$ to $z_0$ for efficient sampling.

Distribution Matching Distillation: Distills the student model to match a multi-step teacher, using a reverse-KL-based loss:

$\mathcal{L}_{\mathrm{DMD}}(\theta) = \frac{1}{2} \mathbb{E}_{z_t, t} \left\| F_\theta(z_t, t) - [F_{\theta^-}(z_t, t) - \frac{t}{1-t}(\nabla_{z_t}\log p_{\mathrm{teacher}}(z_t) - \nabla_{z_t}\log p_\phi(z_t))] \right\|_2^2,$

with $\phi$ as a LoRA-adapter network derived from the teacher parameters.

5. Accelerated Inference Techniques

After consistency tuning and distribution-matching distillation, inference is substantially accelerated:

The trajectory mapping mechanism enables the model to move efficiently from the noise prior ( $t=1$ ) to sample ( $t=0$ ) space.
Post-distillation, high-quality generations are achievable in only 8 function evaluations, representing a measured $12.5\times$ sampling speedup compared to typical ~100-step diffusion pipelines.
Distribution matching ensures fidelity of the accelerated (student) sampler relative to the full-step (teacher) output distribution, maintaining output quality.

6. Quantitative and Qualitative Performance

Skywork UniPic 3.0 achieves leading results across multiple benchmarks:

Model	ImgEdit-Bench	GEdit-Bench
Qwen-Image-Edit	4.25	7.88
Nano-Banana	4.22	7.59
Seedream 4.0	4.11	7.92
UniPic 3.0	4.35	7.79

Model	2–3 Inputs	4–6 Inputs	Overall
Qwen-Image-Edit	0.7705	0.4793	0.6249
Nano-Banana	0.7982	0.6466	0.7224
Seedream 4.0	0.7997	0.6197	0.7088
UniPic 3.0	0.8214	0.6296	0.7255

Qualitative analysis highlights advances in natural occlusion, facial consistency, and instruction fidelity, particularly for complex HOI tasks.

7. Dataset and Code Availability

The full codebase, trained models, and MultiCom-Bench dataset are publicly released at https://skywork-unipic-v3.github.io. The dataset includes curated HOI triplet compositions, open-domain multi-image, and single-image editing data, supporting further research in unified multimodal image synthesis and compositional generative modeling (Wei et al., 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skywork UniPic 3.0.

Skywork UniPic 3.0: Unified Multimodal Framework

1. Architecture and Unified Sequence Construction

2. Sequence-Modeling Paradigm for Unified Tasks

3. Data Curation, Filtering, and Synthesis Pipeline

4. Training Regimen and Loss Formulations

5. Accelerated Inference Techniques

6. Quantitative and Qualitative Performance

7. Dataset and Code Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Skywork UniPic 3.0: Unified Multimodal Framework

1. Architecture and Unified Sequence Construction

2. Sequence-Modeling Paradigm for Unified Tasks

3. Data Curation, Filtering, and Synthesis Pipeline

4. Training Regimen and Loss Formulations

5. Accelerated Inference Techniques

6. Quantitative and Qualitative Performance

7. Dataset and Code Availability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research