MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Published 22 Sep 2023 in cs.CV, cs.AI, and cs.LG | (2309.13042v2)

Abstract: We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.

Abstract PDF Upgrade to Chat

Citations (29)

View on Semantic Scholar

Summary

The paper introduces MosaicFusion, a novel approach that uses diffusion models to synthesize multi-object images and extract accurate segmentation masks.
It leverages cross-attention maps and edge-aware filtering to delineate object boundaries without additional label supervision.
Experimental results on the LVIS dataset show up to a 5.6% improvement in mask AP for rare categories, underscoring its scalable data augmentation potential.

Analysis of "MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation"

The paper presents a novel approach, MosaicFusion, which utilizes diffusion models for data augmentation in the context of large vocabulary instance segmentation. By leveraging an off-the-shelf text-to-image diffusion model, MosaicFusion aims to generate synthetic labeled datasets without additional training or label supervision, catering specifically to the challenges presented by long-tailed distributions and open-vocabulary tasks.

The approach hinges on two key innovations: image generation and mask generation. In brief, the image generation phase involves segmenting an image canvas into multiple regions, each assigned a specific text prompt to generate different object instances simultaneously through a shared noise prediction model. This division allows for the efficient synthesis of images containing multiple objects, effectively simulating real-world scenes. The mask generation phase employs cross-attention maps from the diffusion process to delineate object boundaries. The attention maps are aggregated across layers and time steps to create binary region masks, which are then refined using edge-aware filtering techniques such as Bilateral Solver.

The experimental results demonstrate the method's effectiveness across various baselines for instance segmentation, including Mask R-CNN and CenterNet2, showing significant performance improvements on the LVIS dataset, particularly in rare and novel categories. Notably, MosaicFusion achieves increases of up to 5.6% in mask AP for rare categories compared to the baseline models. Moreover, substantial gains were observed in open-vocabulary detection using F-VLM models, suggesting that MosaicFusion complements the representational power of pre-trained vision-LLMs like CLIP.

The methodology significantly reduces the hurdle of manual annotations in instance segmentation tasks, hence addressing a crucial bottleneck associated with scaling vocabulary size. By producing vast quantities of synthetic labeled data, MosaicFusion provides a scalable solution that could be instrumental in advancing the performance of vision models in diversified and real-world scenarios.

The paper's strong numerical results underscore the potential of leveraging generative modeling techniques for augmentative purposes in discriminative tasks. However, it also implicitly highlights the challenges of closing the domain gap between synthetic and real data, a limitation intrinsic to the state-of-the-art generative models employed. Future directions may include refining the fidelity of synthetic data and expanding the capabilities of diffusion models to capture even more complex scene semantics.

In conclusion, MosaicFusion represents a significant stride in data augmentation methodologies for instance segmentation. Its demonstration of simultaneous multi-object generation and direct mask extraction without auxiliary models marks a step towards more autonomous augmentation processes, which can substantially benefit a range of real-world applications in computer vision. The potential integration with more sophisticated diffusion models and broader adoption across segmentation tasks hold promise for future advancements in AI research and practice.