Papers
Topics
Authors
Recent
Search
2000 character limit reached

SynthSeg Agents: Synthetic Data for ZSWSSS

Updated 20 December 2025
  • SynthSeg Agents is a multi-agent framework that synthesizes high-quality training datasets using coordinated LLM and VLM agents for zero-shot weakly supervised semantic segmentation.
  • It employs iterative self-refinement of prompts with CLIP-based semantic scoring and ViT-driven relabeling to ensure diverse and accurate synthetic annotations.
  • The framework achieves competitive mIoU scores on benchmarks like PASCAL VOC 2012 and MS COCO 2014, narrowing the gap with traditional real-image training approaches.

SynthSeg Agents is a multi-agent framework designed to generate high-quality synthetic data for Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), a task which seeks to train dense prediction models using only synthetic image-level labels with no access to real images. The framework employs coordinated LLM-driven agents for prompt synthesis and image generation, using CLIP-based filtering and Vision Transformer (ViT)-based relabeling, to produce synthetic datasets suitable for weakly supervised semantic segmentation pipelines. SynthSeg Agents demonstrates competitive segmentation performance on benchmarks such as PASCAL VOC 2012 and MS COCO 2014 without using any real images at either the data generation or training stage (Wu et al., 17 Dec 2025).

1. Problem Formulation and Objectives

SynthSeg Agents addresses the problem of Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), which is defined as learning a segmentation model Mseg()M_{\mathrm{seg}}(\cdot) that predicts per-pixel class probabilities from real images, but is trained solely on a synthetic dataset DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n generated without real images. Here, IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3} is a synthetic image, and Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|} is a multi-hot image-level label over a class set C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}. The training objective minimizes a multi-label classification loss on global pooled features:

Lcls=1CcC[yclogpc+(1yc)log(1pc)],\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],

where pcp_c is the predicted probability for class cc.

This decouples synthetic data generation from the segmentation model training: dataset DZSWSSS\mathcal{D}_\mathrm{ZSWSSS} is synthesized entirely by LLM/VLM-driven agents and used as input to any off-the-shelf WSSS model.

2. Self-Refine Prompt Agent

The Self-Refine Prompt Agent generates a bank of diverse, high-quality scene prompts for each class cCc \in \mathcal{C} via a staged process:

2.1 Template Instantiation

A templating function DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n0 leverages an LLM (e.g., GPT-4o) to instantiate scene prompts by populating descriptors such as background, pose, and style for each target class, yielding an initial prompt set DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n1.

2.2 Iterative Self-Refinement & Diversity Filtering

The agent maintains a memory buffer DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n2 of accepted prompts and their CLIP text embeddings. During each refinement iteration:

  • For a prompt DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n3, compute embedding DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n4.
  • Retrieve the nearest neighbor embedding DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n5 in DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n6 using ANN search.
  • If DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n7 (e.g., DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n8), the prompt is sufficiently diverse and added to the refined set DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n9 and IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}0.
  • Prompts undergo LLM-based quality checks, and refined with specific templates if below threshold IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}1.

Pseudocode for this refinement loop is provided in Algorithm 1 of the source.

2.3 CLIP-Based Semantic Scoring

Text–text CLIP similarity is exploited for semantic scoring, formalized as:

IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}2

with output values in [0,1]. This metric governs both diversity acceptance during prompt generation (text–text) and selection in image filtering (text–image).

3. Image Generation Agent

The Image Generation Agent receives the filtered prompt bank IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}3 and synthesizes labeled image samples through a three-stage process:

3.1 Vision–LLM Sampling

Each prompt IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}4 is supplied to a pretrained Vision-LLM (VLM), such as GPT-Image-1, to generate an image IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}5.

3.2 CLIP-Based Dual Filtering

Determination of present classes in IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}6 proceeds via dual alignment:

  • Text alignment: Compare CLIP embeddings of the prompt and class label, IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}7; retain if above threshold IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}8 (e.g., 0.7).
  • Image alignment: Compute similarity between generated image embedding and class label embedding, IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3}9. Only the top-N scoring class–image associations are retained, forming the set Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}0.

3.3 ViT-Based Classifier & Relabeling

A ViT-B/32 model is trained on Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}1 for multi-label classification. Images are patch-embedded as Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}2 with class logits Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}3, Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}4. Aggregation uses global max pooling and binary cross-entropy loss:

Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}5

After convergence, the classifier relabels the entire synthetic dataset, including lower-confidence images, yielding the final training set Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}6.

4. Integrated Pipeline and Training

SynthSeg Agents operates in two sequential stages:

Sequence Module Input / Output
1 Self-Refine Prompt Agent class set Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}7 → Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}8
2 Image Generation Agent Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|}9 → C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}0

Once the synthetic dataset is established, any standard WSSS segmentation architecture (such as SEAM, ToCo, DeepLab) is trained with classification loss C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}1 and segmentation-specific objectives. Segmentation performance is measured in terms of mean Intersection-over-Union (mIoU):

C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}2

where C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}3 and C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}4 denote predicted and ground-truth masks for class C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}5.

5. Experimental Evaluation

SynthSeg Agents is instantiated on two major benchmarks:

5.1 PASCAL VOC 2012

  • 20 classes, 10k synthetic images generated.
  • Baseline WSSS models (ToCo, Seco) trained on real images yield C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}6–C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}7 mIoU.
  • SynthSeg Agents, in zero-shot mode, achieves C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}8 (ToCo) and C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}9 (Seco) mIoU with purely synthetic data.
  • Fine-tuning on real images improves performance to Lcls=1CcC[yclogpc+(1yc)log(1pc)],\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],0 mIoU (on seen classes).

5.2 MS COCO 2014

  • 80 classes, 80k synthetic images synthesized.
  • State-of-the-art real-image WSSS performance: Lcls=1CcC[yclogpc+(1yc)log(1pc)],\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],1–Lcls=1CcC[yclogpc+(1yc)log(1pc)],\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],2 mIoU.
  • SynthSeg Agents achieves Lcls=1CcC[yclogpc+(1yc)log(1pc)],\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],3 mIoU without any real images.
  • Mixing synthetic and real data for fine-tuning produces Lcls=1CcC[yclogpc+(1yc)log(1pc)],\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],4 mIoU, surpassing the tested baselines.

5.3 Ablation Studies

Ablation experiments quantify contributions of agent modules:

Component mIoU (%)
Prompt Agent (Template Only) 48.1
+ Self-Refine (quality scoring) 49.9
+ CLIP diversity filtering 52.5
Image Agent (class-label only) 46.7
+ CLIP filter 50.8
+ CLIP + ViT relabel 52.5

This demonstrates that both iterative prompt refinement with semantic diversity filtering and CLIP/ViT-driven relabeling yield substantial performance improvements.

5.4 Qualitative Analysis

Synthetic images for classes such as “dog,” “horse,” and “airplane” display diverse object poses, backgrounds, and multi-object compositions, outcomes directly attributable to prompt diversity and memory-based filtering.

6. Significance and Implications

SynthSeg Agents establishes that high-quality synthetic training datasets, generated entirely from coordinated LLM and VLM agents, can enable WSSS pipelines to achieve competitive segmentation performance in a true zero-shot setting. The modular architecture, comprising separate agents for prompt generation and image filtering/label refinement, supports semantic diversity and controllable data synthesis. Fine-tuning with real data further closes or exceeds the gap with traditionally supervised approaches. This framework highlights the potential for scalable WSSS and data-efficient semantic segmentation workflows unbounded by real-image availability (Wu et al., 17 Dec 2025).

A plausible implication is that LLM-driven synthetic data engines may become central to future semantic segmentation, particularly in domains or tasks where annotated datasets are scarce or unavailable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SynthSeg Agents.