Scene Graph Generation (SGG)

Updated 2 February 2026

Scene Graph Generation is the task of creating a structured graph where nodes represent objects and edges denote semantic relationships between them.
It employs diverse methodologies including two-stage, one-stage, and generative approaches to jointly detect objects and predict relations.
Advanced techniques like sample-level bias prediction and causal inference address long-tailed predicate distributions, improving both head and tail recall.

Scene Graph Generation (SGG) is the task of inferring a structured, graph-based representation of the relationships between objects in a visual scene. The output is a directed graph where nodes correspond to detected objects (with category labels and often bounding boxes), and edges represent semantic predicates (relations) connecting those objects. SGG has become foundational for downstream tasks in vision-language reasoning, visual question answering, captioning, and robotics due to its explicit, compositional encoding of scene semantics (Zhu et al., 2022). This article surveys the key principles, algorithmic advances, evaluation protocols, and current research frontiers in SGG, highlighting the field’s transition toward fine-grained, unbiased, and open-vocabulary relational modeling.

1. Formal Definition and Fundamental Modeling Paradigm

Let an image $I$ be mapped to a scene graph $G = (V, E)$ , where $V = \{ o_1, \ldots, o_N \}$ are object nodes with categorical labels and bounding boxes, and $E = \{ (i, j, p) \}$ is a set of directed edges (subject $o_i$ , object $o_j$ , predicate $p$ ). The task is to maximize the conditional probability $p(G | I)$ , decomposed as

$p(G | I) = p(B|I) \cdot p(O|B,I) \cdot p(R|O,B,I)$

with $B$ object boxes, $O$ object labels, and $R$ pairwise predicates (Zhu et al., 2022). Standard pipelines include:

Two-Stage Models: Independently detect objects, then classify relations for each object pair (Han et al., 2021).
One-Stage Models: Jointly predict objects and their relationships in a unified architecture (Chen et al., 2023).
Generative Approaches: Directly generate the graph structure, often with transformers or autoregressive models (Kundu et al., 2022, Garg et al., 2021).

Evaluation is performed under three settings:

Predicate Classification (PredCls): Given boxes and object labels, predict predicates.
Scene Graph Classification (SGCls): Given boxes, predict object labels and predicates.
Scene Graph Generation (SGDet): Predict boxes, labels, and predicates jointly (Han et al., 2021).

The dominant metrics are Recall@K (R@K) and mean Recall@K (mR@K), with mR@K correcting the inherent bias toward frequent predicates (Han et al., 2021).

2. Addressing Dataset Bias and Long-Tailed Prediction

A central challenge in SGG is the extreme long-tailed distribution of predicate classes: high-frequency, coarse predicates (e.g., "on", "has") dominate, obscuring semantically informative, rare (tail) predicates (e.g., "carrying", "parked on"). Standard likelihood-based models are heavily biased, resulting in head predicate overprediction and poor tail recall (Li et al., 2024, Zheng et al., 2022).

Debiasing Strategies:

Sample-Level Bias Correction: The Sample-Level Bias Prediction (SBP) framework explicitly models the per-sample logit correction $b_s$ to refine the original, biased predicate logits $z_{i,j}$ :

$\hat z_{i,j} = z_{i,j} + b_s$

The SBP method uses a lightweight union-region encoder and a Bias-Oriented GAN to learn a distribution over correction vectors, calibrated to correct coarse-to-fine predicate misclassifications at the object-pair level (Li et al., 2024). This outperforms dataset-level debiasing (DLFE, RTPB) in Average@K on VG, GQA, and VG-1800, with gains up to +5.6% for PredCls (Li et al., 2024).

Causal Inference (TDE): The Total Direct Effect framework subtracts the contextual bias by counterfactually intervening on visual features, isolating the causal effect of visual content $X$ on predicate logits $Y$ :

$TDE = Y_X(u) - Y_{\bar x, z}(u)$

This agnostic approach consistently improves mean recall across strong SGG backbones, nearly doubling mR@100 compared to reweighting and resampling (Tang et al., 2020).

Curriculum Reweighting: The SGG-HT framework introduces a curriculum reweighting schedule over predicate classes, initializing training on head classes and linearly annealing to emphasize tail samples. This is paired with semantic consistency constraints to prevent semantic drift (Zheng et al., 2022).
Graph Topology and Message Passing: Richer aggregation mechanisms, such as DualMPNNs over both the object and edge-dual graphs (Kim et al., 2023), and explicit co-occurrence/Self-TF-IDF feature scaling (Kim et al., 2024), further mitigate head bias by amplifying context propagation to tail predicates.

3. Region-Level Context and Feature Extraction Strategies

The spatial and semantic context at the level of object pairs is critical. Union-region features, which pool over the tightest union of the two objects’ bounding boxes (supplemented by spatial positional embeddings), have been shown to outperform both whole-image and global context for fine-grained predicate disambiguation (Li et al., 2024, Liu et al., 2022). In 3D SGG and aerial urban scenes, interaction-specific 3D/spatial subregions and adaptive bounding-box scaling factors sharply restrict the candidate set of relationships and prevent context overload in dense graphs (Liu et al., 2022, Li et al., 2024). These strategies, often coupled with specialized pruning mechanisms, are essential for scaling SGG to high-object-count or cluttered domains.

4. Generative, Open-Vocabulary, and Structured SGG

Recent advances push SGG toward generative and open-vocabulary regimes:

Autoregressive and Transformer-Based Scene Graph Generation: Generative models treat SGG as a structured sequence (or set) prediction problem. Structured Sparse R-CNN directly learns sets of triplet queries optimized end-to-end (Teng et al., 2021). Iterative scene graph generation methods use transformer decoders to sample plausible adjacency (structure) graphs, greatly reducing O( $N^2$ ) pairwise computations and improving scalability (Kundu et al., 2022). Autoregressive unconditional models decouple the generation of object nodes and edge sequences (SceneGraphGen), capturing the combinatorial semantics of realistic scenes (Garg et al., 2021).
Open-Vocabulary and Vision-Language Integration: Recent models (OvSGTR, PGSG, OwSGG) leverage vision-LLMs (e.g., CLIP, BLIP, VLMs) for fully open-vocabulary detection and relation recognition. Techniques include contrastive alignment of visual and text embeddings, relation-aware pretraining on image-caption triplets, and knowledge-distillation strategies to mitigate catastrophic forgetting of rare predicates (Chen et al., 2023, Li et al., 2024, Dutta et al., 9 Jun 2025). These approaches achieve non-zero recall on previously unseen predicates and objects, outperforming closed-set models in open settings.
Location-Free and Aerial Scene Graphs: Models such as Pix2SG demonstrate that instance- and relationship-level semantics can be predicted without explicit localization, provided sufficient relational and contextual modeling (Özsoy et al., 2023). In specialized domains like aerial urban imagery, locality-preserving GCNs and adaptive pruning are leveraged to avoid over-smoothing and suppress meaningless object pairs (Li et al., 2024).

5. Robustness, Scalability, and Evaluation Protocols

Benchmarks such as the SGG Benchmark (Han et al., 2021) provide strong baselines, modular evaluation scripts, and standardized splits for rigorous comparisons across pipelines. Object detection quality, pair proposal policy, and relation classification are all isolated as error sources. Robustness to visual domain shift (e.g., weather corruptions) is emerging as a critical research axis: HiKER-SGG employs coarse-to-fine hierarchical reasoning over an external knowledge graph, coupled with adaptive bias-correction, to preserve high recall under 20 classes of image corruptions (Zhang et al., 2024).

Best practices include:

Always report Recall@K and mean Recall@K (per-predicate averaged), especially on long-tail and zero-shot splits.
Ablate region scope, bias-correction mechanisms, and architectural context modules to determine their impact (Li et al., 2024, Zhang et al., 2024).
Decompose error into detection, proposal, and classification components (Han et al., 2021).
Where possible, use decoupled architectures to prevent co-adaptation of object and relation modules, yielding more stable and generalizable models (Han et al., 2021).

6. Sample-Level Bias Prediction: The SBP Framework

The Sample-Level Bias Prediction (SBP) method represents a leading approach for fine-grained, unbiased SGG (Li et al., 2024):

Correction Bias Construction: For each object-pair, the SBP pipeline computes a correction bias $b_s$ using a transformer-based encoder over the union-region features, together with an explicit global dataset-level bias vector $b^{glo}$ :

$b^{init} = \phi(f_{uni}) + b^{glo}$

Margins are adjusted to ensure the ground-truth predicate becomes the logit maximum.

Bias-Oriented GAN: A 5-layer 1D-convolutional generator $G$ predicts sample-level corrections, supervised adversarially by a discriminator $D$ and regularized with classification loss over the corrected logits:

$L_G = - E[T_G] + \alpha \text{CE}(softmax(z + b^{pre}), r_{tru})$

The overall objective combines the base SGG loss with adversarial/classification terms, with two-phase training (SGG then BGAN).

Inference and Correction: At test time, each candidate pair is logit-corrected as $\hat z_{i,j} = z_{i,j} + b_s$ (with $b_s$ predicted by $G$ ).
Experimental Results: On Visual Genome, GQA, and VG-1800, SBP achieves state-of-the-art balance between head and tail recall in PredCls, SGCls, and SGDet (e.g., +5.6% A@K for PredCls over prior dataset-level correction). Ablations confirm the necessity of union-region scope, global bias, transformer encoder $\phi$ , and adversarial loss; training also benefits from a two-phase strategy (Li et al., 2024).

7. Current Challenges and Future Directions

Major open problems in SGG research include:

Persistent biases and catastrophic forgetting in long-tailed predicates, especially under incremental/continual learning (Khandelwal et al., 2023).
Scaling to open-world, zero-shot relational inference without labeled training data, leveraging pretrained VLMs and grounding (Dutta et al., 9 Jun 2025).
Robustness to domain shift under corruptions, dense clutter, or non-iconic imagery (Zhang et al., 2024, Li et al., 2024).
Integrating 3D/temporal reasoning, attributes, and external knowledge at scale (Liu et al., 2022).
Efficient, real-time inference suitable for embodied agents and edge deployment (Neau et al., 2024).
Unified benchmarks and metrics, especially for open-vocab, location-free, and high-object-count scenes (Han et al., 2021).

Promising further directions include task-adaptive, end-to-end training of object and relation modules, more powerful data-driven curriculum and bias-correction schedules, and tight integration of scene graphs with downstream vision-language architectures for full-stack semantic reasoning (Chen et al., 2023, Zheng et al., 2022).

References:

(Li et al., 2024) Zhu et al., Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
(Tang et al., 2020) Tang et al., Unbiased Scene Graph Generation from Biased Training
(Chen et al., 2023) Duan et al., Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation
(Kim et al., 2024) Lee et al., Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency
(Kundu et al., 2022) Li et al., Iterative Scene Graph Generation with Generative Transformers
(Han et al., 2021) Han et al., Image Scene Graph Generation (SGG) Benchmark
(Zhang et al., 2024) Zhang et al., HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
(Li et al., 2024) Han et al., AUG: A New Dataset and An Efficient Model for Aerial Image Urban Scene Graph Generation
(Liu et al., 2022) Liu et al., Explore Contextual Information for 3D Scene Graph Generation
(Kim et al., 2023) Kim et al., Semantic Scene Graph Generation Based on an Edge Dual Scene Graph and Message Passing Neural Network
(Teng et al., 2021) Liu et al., Structured Sparse R-CNN for Direct Scene Graph Generation
(Zheng et al., 2022) Yu et al., Learning To Generate Scene Graph from Head to Tail
(Zhu et al., 2022) Zhu et al., Scene Graph Generation: A Comprehensive Survey