Instance-Aware Colorization Methods

Updated 12 December 2025

The paper introduces a multi-stream fusion architecture that overcomes figure-ground confusion and semantic misalignment in multi-object scenes.
Instance-aware colorization leverages object detection and per-instance processing to boost boundary accuracy and improve metrics like PSNR and SSIM.
Recent advancements integrate diffusion models, masked attention, and cross-modal guidance to achieve spatially coherent, semantically aligned, and temporally consistent colorization.

Instance-aware colorization refers to methods for automated colorization that explicitly reason over, segment, and differentially process individual object instances in a grayscale or line-art image (or video), thereby overcoming the limitations of global, context-only approaches. This paradigm achieves spatially coherent, semantically meaningful, and artifact-free colorization, particularly in scenes with multiple, overlapping, or complex objects. Recent advances leverage diffusion models, cross-modal encoders, attention modules, and fusion strategies to realize precise, instance-conditioned color transfer under both language and visual (reference) guidance across both images and video sequences (Su et al., 2020, Chang et al., 2023, Chang et al., 2024, Zhang et al., 21 Mar 2025, An et al., 13 May 2025).

1. Problem Motivation and Core Challenges

Traditional automatic colorization methods—whether regression-based or using GANs and diffusion models—operate on entire images, typically inferring global color statistics or leveraging overall learned color priors. While effective for background and large-scale structure, such approaches suffer from two fundamental problems in multi-object scenes:

Figure-ground confusion: Without explicit separation, the model may blend colors between foreground and background, or across adjacent objects, resulting in color bleeding or color binding errors.
Semantic misalignment: Per-object color hints (e.g., “the vase is blue, the flowers are red”) cannot be accurately localized without explicit instance-aware mechanisms; generic global approaches ignore object boundaries and ignore the one-to-many mapping between objects and plausible color assignments.

These limitations are exacerbated in scenarios with ambiguous or partial guidance, multiple small or overlapping objects, or when creative or user-specified coloring is desired. Addressing them requires explicit object instance reasoning, mask-based localization, and advanced attention mechanisms to ensure color is correctly mapped at the instance level (Su et al., 2020, Chang et al., 2023, An et al., 13 May 2025).

2. Classical and Foundational Approaches

The seminal instance-aware colorization method employs a multi-stream network with explicit figure-ground separation (Su et al., 2020). The key components are:

Object Detection and Instance Cropping: An object detector such as Mask R-CNN is used to extract bounding boxes for up to $N$ object instances. Each is cropped from the grayscale input and resized to a canonical size.
Parallel Colorization Networks: Two U-Net backbones are used—one on the full image, one on each object crop. These separately encode object-level and scene-level features to $f_i^j$ and $f_{\mathrm{full}}^j$ at each layer $j$ .
Hierarchical Feature Fusion: A fusion module merges per-instance and global features at all decoder layers via a spatial softmax-based weighting. For each location $(x,y)$ ,

$\bar f^j(x,y) = \alpha_F(x,y) \cdot f_{\mathrm{full}}^j(x,y) + \sum_{i=1}^{N} \alpha_i(x,y) \cdot \tilde f^j_i(x,y)$

where the $\alpha$ weights are adaptively learned and spatially normalized across all sources.

Training is conducted in three sequential stages: global network, instance network, fusion module. Evaluation demonstrates a substantial improvement in PSNR, SSIM, and LPIPS, especially for dense, multi-object scenes and at object boundaries. The subjective user studies confirm increased preference and realism over prior art (Su et al., 2020). This work established fundamental gains for instance-aware modeling and identified fusion and object detection as critical performance factors.

3. Diffusion-Based Instance-Aware Colorization

Diffusion models have rapidly improved colorization quality via richer, multimodal color priors and robust denoising. Several contemporary systems extend this backbone with explicit instance conditioning:

3.1. Masked Attention and Instance Guidance in Images

MT-Color proposes two core modules within a diffusion U-Net framework for instance-level colorization with strong text-mask binding (An et al., 13 May 2025):

Pixel-level Masked Cross-Attention: For latent fusion of grayscale (ControlNet) and colorization features, attention is masked such that only pixels within the same object instance can attend to each other, thereby eliminating boundary bleeding:

$M(i,j) = 1 \text{ if instance}(i) = \text{instance}(j),\ \text{else } 0$

This explicitly restricts information flow and alignment to within each mask.

Instance Mask & Text Guidance via Masked Self-Attention: Each object instance is encoded as a combination of its segmentation mask and textual prompt, producing “instance embeddings” $\gamma_k$ . During self-attention, only latents and instance features from the corresponding mask are allowed to attend, allowing strict color binding:

$\hat y = [1:L,1:L] \text{ of } (M_s \circ S \cdot V_p)$

Multi-Instance Sampling Strategy: Noise denoising proceeds in two steps—first, per-instance regions are processed separately (based on their individual masks and prompts), then the results are fused and global denoising is completed, boosting both boundary localization and textual alignment.

MT-Color leverages a novel dataset, GPT-Color, combining high-quality masks (from RAM) and per-instance text descriptions (from GPT-4/BLIP2), to enable these mechanisms. It achieves state-of-the-art results across FID, colorfulness, and perceptual quality. Ablation studies confirm that each module is essential for instance-aware performance (An et al., 13 May 2025).

3.2. Language-Guided Any-level Instance Control

L-CAD utilizes a latent diffusion backbone, a dedicated luminance encoder for grayscale input, and a cross-attention mechanism that fuses CLIP-based text embeddings with segmented latent features (Chang et al., 2023). For instance-aware colorization:

Preliminary object masks are generated by a referring segmentation model (e.g., SAM) based on the textual description.
During diffusion sampling, the model refines cross-attention maps at each layer by optimizing for agreement (via binary cross-entropy loss and gradient steps) with the estimated object masks.
The model can flexibly respond to complete, partial, or scarce linguistic color hints, yielding plausible colorizations with precise per-object control and fallback to automatic mode for unmentioned regions.

Quantitative and AMT-based user studies demonstrate that these instance-aware strategies yield both higher textual alignment and subjectively preferred realism compared to both mask-free and generic language-only models (Chang et al., 2023).

4. Video Colorization and Temporal Instance Consistency

Instance-aware colorization of video sequences presents the additional challenge of temporal consistency—colored objects must remain identically colored and free of flicker despite motion, deformation, and scene transitions.

L-C4 addresses these by integrating a highly structured latent diffusion backbone with the following mechanisms (Chang et al., 2024):

Cross-Modality Pre-Fusion Module (CMPF): Produces instance-aware token embeddings from text by masked cross-attention with the video features, zeroing out color words so noun tokens are visually grounded.
Temporally Deformable Attention (TDA): Tracks moving objects by learning continuous spatial-temporal offsets, aggregating features along the movement trajectory for each object. This ensures each object’s color and style are preserved even as its position or shape changes between frames.
Cross-Clip Fusion (CCF): At inference, aggregates feature priors from multiple overlapping temporal clips using weighted sums, propagating instance color over potentially long video intervals. This eliminates boundary artifacts and preserves long-term color stability.

No explicit adversarial or temporal loss is needed; temporal coherence emerges from the architecture. L-C4 supports arbitrary user-provided language descriptions and demonstrates superior semantic accuracy, creative expressiveness, and cross-frame consistency compared to prior exemplar- or post-processing-based methods (Chang et al., 2024).

5. Specialized Domains: Sketch and Line-Art Multi-Instance Colorization

The demands of style transfer and colorization in domains such as anime, comics, or illustration echo those of natural images but with unique constraints on edge precision, style transfer, and artist workflow.

MagicColor extends latent diffusion models to line-art with multiple object references by (Zhang et al., 21 Mar 2025):

Instance Guider (ICM): Encodes spatial and global features of each reference instance, aligns them to the corresponding mask in the target sketch, and injects these embeddings as control signals into the diffusion U-Net.
Self-play Training: Simulates multi-instance training data from single-instance image sources by mask extraction, synthetic fusion, and augmentation.
Edge-Weighted and Color Matching Losses: Edge re-weighting emphasizes boundary and structural consistency; a fine-grained color matching loss explicitly associates reference and predicted color at the pixel level via cosine-nearest neighbor matches.

Empirical results demonstrate high chromatic precision and instance consistency, with significantly lower FID and higher PSNR/SSIM than previous GAN- or diffusion-based anime colorization methods. Edge and color matching modules are found to be critical via ablation (Zhang et al., 21 Mar 2025).

6. Datasets, Evaluation, and Leading Results

Instance-aware colorization methods rely on datasets with precise instance masks and object-level descriptions:

COCO-Stuff and GPT-Color: Provide large-scale ground truth for both segmentation and textual tags; GPT-Color improves instance-level textual alignment scores and enables higher colorfulness and perceptual scores (MUSIQ, TOPIQ_NR) relative to other benchmarks (An et al., 13 May 2025).
Anime/Sketch Datasets: Combine manual curation, SAM-based mask extraction, and synthetic augmentation to provide multi-instance, reference-guided learning for line art (Zhang et al., 21 Mar 2025).

Evaluation is performed using a suite of metrics:

Metric	Role	Typical Instance-Aware Leaders
FID (↓)	Distribution realism	MT-Color: 11.39 (An et al., 13 May 2025); MagicColor: 28.44 (Zhang et al., 21 Mar 2025)
PSNR (↑)	Pixel accuracy	Ours: 23.12 (An et al., 13 May 2025); MagicColor: 23.49 (Zhang et al., 21 Mar 2025); Su et al.: 28.34 (obj) (Su et al., 2020)
SSIM (↑)	Structural similarity	Ours: 0.8714 (An et al., 13 May 2025); MagicColor: 0.805 (Zhang et al., 21 Mar 2025); Su et al.: 0.929 (obj) (Su et al., 2020)
LPIPS (↓)	Perceptual distance	Ours: 0.115 (obj) (Su et al., 2020)
CLIP-score (↑)	Text-visual alignment	MT-Color: 0.2273 (instance) (An et al., 13 May 2025)

Ablation results across these works demonstrate that disabling mask-based attention, instance fusion, or reference control degrades boundary adherence, increases color bleeding, and reduces alignment.

7. Current Limitations and Research Directions

Despite robust advances, several challenges and open questions remain:

Segmentation Reliability: Approaches relying on external segmentation (SAM, Mask-RCNN, RAM) are sensitive to miss-detections, which can degrade color localization (Su et al., 2020, Chang et al., 2023, Zhang et al., 21 Mar 2025).
Crowded Scenes and Occlusion: Instance guidance may become ambiguous in scenes with heavy object overlap or a large number of small regions, occasionally resulting in misalignment or unconvincing color assignments (Zhang et al., 21 Mar 2025).
Resolution and Domain Generalization: Many current models are limited to $512\times512$ resolution; scaling to higher resolutions requires additional architectural or computational innovations (Zhang et al., 21 Mar 2025).
Interactive Refinement: Most methods operate fully automatically; interactive correction of per-instance color or mask is an active area for user-in-the-loop improvement (Zhang et al., 21 Mar 2025).

Potential future directions include integrating stronger segmentation, supporting interactive guidance, hierarchical or tiled diffusion architectures for high-res outputs, and further expansion into style-consistent automatic colorization for broad content genres (Zhang et al., 21 Mar 2025, Chang et al., 2024, An et al., 13 May 2025).

Instance-aware colorization has become a key paradigm in both image and video processing, providing the technological foundation for semantically-accurate, creative, and boundary-respecting automated coloring. The use of deep fusion modules, attention-masked conditioning, diffusion priors, and cross-modal language or visual references collectively define the state of the art, with ongoing research rapidly expanding capacity for increasingly complex and creative tasks.