Papers
Topics
Authors
Recent
Search
2000 character limit reached

Skywork UniPic 3.0: Unified Multimodal Framework

Updated 29 January 2026
  • Skywork UniPic 3.0 is a unified multimodal framework that supports both single-image editing and multi-image composition, emphasizing Human-Object Interaction (HOI) synthesis.
  • It integrates vision-language encoding, image tokenization, and a diffusion-based backbone under a fixed global pixel budget to handle 1–6 input images efficiently.
  • The framework achieves state-of-the-art benchmarks through rigorous data curation, multi-stage training, and accelerated inference with only 8 function evaluations.

Skywork UniPic 3.0 is a unified multimodal generative framework designed for both single-image editing and multi-image composition, emphasizing Human-Object Interaction (HOI)-centric synthesis. It supports an arbitrary number (1–6) and resolution of input images under a shared pixel budget, outputs high-fidelity compositions at target resolutions, and achieves state-of-the-art performance benchmarks by recasting composition as a sequence-modeling task with efficient inference (Wei et al., 22 Jan 2026).

1. Architecture and Unified Sequence Construction

Skywork UniPic 3.0 integrates various subsystems to address the complexity of compositional image synthesis across modalities. The framework consists of:

  • Vision–Language Encoder: Utilizes Qwen2.5-VL to encode user instructions alongside metadata (e.g., capture resolution, shape descriptors).
  • Image Tokenizer: Employs a VAE encoder fvae-encf_{\mathrm{vae\text{-}enc}} mapping each image II to a latent tensor zR1×C×H×Wz \in \mathbb{R}^{1 \times C \times H' \times W'}.
  • Patch-Wise Packing: Latents are transformed into sequences of patch tokens by s=pack(z)RN×Ds = \mathrm{pack}(z) \in \mathbb{R}^{N \times D}, with N=HW/4N = H' W'/4, D=4CD = 4C.
  • Backbone Diffusion Model: An MMDiT (latent diffusion transformer) that operates over a unified sequence of tokens.

For processing, KK reference images {Ik}k=1K\{I_k\}_{k=1}^K and the output image OO are packed as a concatenated sequence: S=[sOs1s2sK]R(NO+k=1KNk)×D,S = [s_O \,\|\, s_1 \,\|\, s_2 \,\|\cdots\|\, s_K] \in \mathbb{R}^{(N_O + \sum_{k=1}^K N_k) \times D}, with a parallel sequence of shape descriptors H={hO,h1,...,hK}\mathcal{H} = \{h_O, h_1, ..., h_K\}, where each hi=(Hi,Wi)h_i = (H'_i,W'_i).

To preserve computational tractability, the model enforces a global pixel budget: i=O,1KHiWi10242\sum_{i = O,1}^K H_i\,W_i\leq 1024^2 ensuring that the combined input-output resolution remains bounded.

2. Sequence-Modeling Paradigm for Unified Tasks

Both single-image editing and multi-image composition are formulated uniformly as sequence-to-sequence conditional synthesis. Context tokens X={s1,...,sK}\mathbf{X} = \{s_1, ..., s_K\} (representing references) condition the generation of output tokens Y=sO\mathbf{Y} = s_O: pθ(YX)=t=1Tpθ(yty<t,X)p_\theta(\mathbf{Y}\mid\mathbf{X}) = \prod_{t=1}^T p_\theta(y_t\mid y_{< t}, \mathbf{X}) with the cross-entropy loss objective

LCE=E(X,Y)t=1Tlogpθ(yty<t,X)\mathcal{L}_{\mathrm{CE}} = -\mathbb{E}_{(\mathbf{X},\mathbf{Y})} \sum_{t=1}^{T} \log p_\theta(y_t\mid y_{< t}, \mathbf{X})

Variable-length reference handling is inherently supported by concatenation, subject to pixel budget and accompanied by explicit shape descriptors to maintain token-to-patch correspondence during generation and reconstruction.

3. Data Curation, Filtering, and Synthesis Pipeline

A stringent and automated three-stage data pipeline is central to Skywork UniPic 3.0’s empirical performance, with an explicit focus on HOI-centric compositions:

  1. Data Collection:
    • Person images: 18,000 from CC12M, captioned with InternVL3.5-38B for fine-grained pose and apparel labeling.
    • Object categories: 300 human-interactive classes generated with GPT-4o; 5,000 text prompts per class rendered via Qwen-Image, yielding 150,000 images.
  2. Data Filtering:
    • InternVL3.5-38B: holistic quality scoring in [0,100][0,100]; retain instances with scores 75\geq 75.
    • Face/body detectors: threshold face visibility (90%\geq 90\%) and subject occupancy (60%\geq 60\%).
    • CLIPScore and resolution: prompt-image consistency ≥ threshold, object images ≥ 7682768^2.
  3. Synthesis:
    • For KK in [2,6][2,6], consider all valid image combinations, subject to a conflict matrix (e.g., one person cannot hold multiple guitars).
    • Compose prompts using InternVL3.5; for K3K\leq 3, targets are synthesized by NanoBanana; for K>3K>3, by Seedream4.0.
    • Retain only samples passing stringent aesthetic and identity checks.

This procedure results in 215,000 HOI triplets, with the full dataset comprising approximately 700,000 high-quality sample pairs.

4. Training Regimen and Loss Formulations

Skywork UniPic 3.0 adopts a multi-stage, diffusion-based optimization pipeline:

  • Pre-Training (Multi-Task Diffusion): Optimizes the MMDiT backbone with the flow-matching loss

LFM(θ)=Ez0pdata,t[0,1]Fθ(zt,t)dztdt22,\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{z_0 \sim p_{\mathrm{data}}, t \sim [0,1]} \left\| F_\theta(z_t, t) - \frac{dz_t}{dt} \right\|_2^2,

where zt=(1t)z0+tϵz_t = (1-t)z_0 + t\epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0,I).

  • Continuous-Time Consistency Tuning: Applies a consistency loss for trajectory mapping

LCM(θ)=Ezt,tFθ(zt,t)+zt+Fθ(zt,t)+ttFθ(zt,t)22,\mathcal{L}_{\mathrm{CM}}(\theta) = \mathbb{E}_{z_t,t} \left\| F_\theta(z_t,t) + z_t + F_{\theta^-}(z_t,t) + t\partial_t F_{\theta^-}(z_t,t) \right\|_2^2,

facilitating direct mapping from ztz_t to z0z_0 for efficient sampling.

  • Distribution Matching Distillation: Distills the student model to match a multi-step teacher, using a reverse-KL-based loss:

LDMD(θ)=12Ezt,tFθ(zt,t)[Fθ(zt,t)t1t(ztlogpteacher(zt)ztlogpϕ(zt))]22,\mathcal{L}_{\mathrm{DMD}}(\theta) = \frac{1}{2} \mathbb{E}_{z_t, t} \left\| F_\theta(z_t, t) - [F_{\theta^-}(z_t, t) - \frac{t}{1-t}(\nabla_{z_t}\log p_{\mathrm{teacher}}(z_t) - \nabla_{z_t}\log p_\phi(z_t))] \right\|_2^2,

with ϕ\phi as a LoRA-adapter network derived from the teacher parameters.

5. Accelerated Inference Techniques

After consistency tuning and distribution-matching distillation, inference is substantially accelerated:

  • The trajectory mapping mechanism enables the model to move efficiently from the noise prior (t=1t=1) to sample (t=0t=0) space.
  • Post-distillation, high-quality generations are achievable in only 8 function evaluations, representing a measured 12.5×12.5\times sampling speedup compared to typical ~100-step diffusion pipelines.
  • Distribution matching ensures fidelity of the accelerated (student) sampler relative to the full-step (teacher) output distribution, maintaining output quality.

6. Quantitative and Qualitative Performance

Skywork UniPic 3.0 achieves leading results across multiple benchmarks:

Model ImgEdit-Bench GEdit-Bench
Qwen-Image-Edit 4.25 7.88
Nano-Banana 4.22 7.59
Seedream 4.0 4.11 7.92
UniPic 3.0 4.35 7.79
Model 2–3 Inputs 4–6 Inputs Overall
Qwen-Image-Edit 0.7705 0.4793 0.6249
Nano-Banana 0.7982 0.6466 0.7224
Seedream 4.0 0.7997 0.6197 0.7088
UniPic 3.0 0.8214 0.6296 0.7255

Qualitative analysis highlights advances in natural occlusion, facial consistency, and instruction fidelity, particularly for complex HOI tasks.

7. Dataset and Code Availability

The full codebase, trained models, and MultiCom-Bench dataset are publicly released at https://skywork-unipic-v3.github.io. The dataset includes curated HOI triplet compositions, open-domain multi-image, and single-image editing data, supporting further research in unified multimodal image synthesis and compositional generative modeling (Wei et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skywork UniPic 3.0.