Scene Graph Feature Banks

Updated 22 January 2026

Scene graph feature banks are structured representations that aggregate feature-level encodings of objects and their relationships for efficient local and global reasoning.
They integrate diverse modalities, leveraging techniques like CLIP-based extraction, temporal pooling, and graph autoencoding to enhance 3D scene and action understanding.
Design strategies such as entropy-based view selection and masked autoencoding improve performance, computational efficiency, and interpretability across applications.

A scene graph feature bank is a structured memory or representation that accumulates, stores, and exposes feature-level encodings of objects and their relationships as defined by scene graphs. These feature banks are central to recent approaches in open-vocabulary 3D scene understanding, compositional action recognition from video, and context-aware visual analytics. The core idea is to abstract scene elements and their interactions into a compact, queryable database of features—derived from raw data, structured as graphs—that can be indexed, fused, and analyzed for downstream reasoning.

1. Conceptual Foundations and Definitions

A scene graph is a graph-based abstraction in which nodes represent semantically coherent entities (objects, stuff categories, or other primitives) and edges represent relationships or predicates between them. A scene graph feature bank (SGFB) is a data structure that aggregates, stores, and exposes these object- and relation-level features, typically as sets, tensors, or dictionaries. SGFBs enable both local (object or segment) and global (scene-level or temporal) reasoning, and facilitate integration with learned visual and linguistic features.

Distinct paradigms exist for building SGFBs, as evident from three major recent works:

In the open-vocabulary 3D setting, Kassaab et al. define their feature bank as a mapping from 3D object segments to CLIP-derived per-object descriptors—optimized for minimal computation and storage without loss of open-vocabulary generalization (Kassab et al., 2024).
Ji et al., for action recognition with Action Genome, build a temporal SGFB by flattening and storing per-frame confidence matrices over object–relation pairs, enabling efficient retrieval and fusion with 3D-CNN backbones (Ji et al., 2019).
Liu et al. construct the SGFB as a collection of compact scene embeddings, each derived from a heterogeneous panoptic scene graph via masked graph autoencoding, supporting scene-level perceptual prediction and interpretability (Liu et al., 22 Dec 2025).

2. Construction and Representation of Scene Graph Feature Banks

2.1. Object-centric 3D SGFB (Kassab et al.)

Construction proceeds through a five-stage pipeline:

Input acquisition: Sequence of RGB-D images paired with camera poses.
3D segmentation: Region-growing geometry-based segmentation generates class-free segments $S_1,...,S_M$ in fused point clouds.
View association: Visible frames for each segment are determined; each segment is projected to 2D and suitably cropped for feature extraction.
Feature extraction: For each crop, a single forward pass of CLIP (ViT-H/14 backbone) yields $x_{m,v}\in\mathbb{R}^{1024}$ at crop factor $\alpha=1.5$ .
Feature selection: For each segment, one view is chosen (typically via minimum-entropy selection across predicted label distributions), yielding a final scene graph feature bank mapping segment indices $m$ to features $x_m^*$ .

The internal representation is thus an $M\times 1024$ matrix or a keyed float32 dictionary, readily enabling cosine-similarity-based zero-shot queries or retrieval (Kassab et al., 2024).

2.2. Temporal SGFB for Video (Action Genome)

Each frame is parsed into a human-centric scene graph; a confidence matrix $C_t\in\mathbb{R}^{|O|\times|R|}$ (objects × relationships) is populated per frame, flattened to $f_t\in\mathbb{R}^{|O|\cdot|R|}$ , and temporally stacked: $F_{SG}\in\mathbb{R}^{T\times{|O|\cdot|R|}}$ for $T$ frames. This representation supports temporal querying via windowing, pooling (mean/max), or more advanced operators (e.g., non-local attention) (Ji et al., 2019).

2.3. Scene-level Relational Embeddings (Liu et al.)

SVI scenes are parsed by an open-vocabulary panoptic scene graph model (OpenPSG) into graphs with heterogeneous nodes (objects, stuff) and edges (predicates). SBERT embeddings are computed for all labels. The entire graph structure is encoded by a relational GNN-based masked autoencoder (GraphMAE), outputting a compact 128-dimensional embedding per scene, $z_{\mathrm{scene}}$ , via mean pooling over node embeddings. The collection of $z_{\mathrm{scene}}$ vectors comprises the feature bank (Liu et al., 22 Dec 2025).

3. Methodological Variations and Design Choices

3.1. Image Preprocessing and Feature Extraction

Kassab et al. rigorously ablate the need for multi-scale crops, SAM mask variants, and averaging. Empirical results show minimal accuracy gains with these augmentations (e.g., single-crop at $\alpha=1.5$ : 11.3% vs multi-mask/crop: up to 14.6%) but a 3x increase in computation, justifying adoption of single-scale cropping (Kassab et al., 2024).

3.2. View Fusion versus Entropy-based Selection

Simple averaging or mode-voting across multi-view features leads to significant performance degradation (13.0–13.6% accuracy) relative to an oracle upper bound (25–31%) in open-vocabulary 3D instance classification. Entropy-based selection, by contrast, consistently yields superior results (14.1%), especially when computed over a domain-relevant prompt list, with improvements up to 5% when restricting entropy computation to smaller, semantically coherent label sets (Kassab et al., 2024).

3.3. Temporal Pooling and Attention Mechanisms

In the Action Genome paradigm, the SGFB is operated on via “feature bank operators” such as mean- or max-pooling, or non-local blocks for attention-based fusion. This compresses the $T\times 875$ tensor to a lower-dimensional summary compatible for downstream concatenation with standard video backbone features, supporting both long-term and local action representation (Ji et al., 2019).

3.4. Masked Autoencoding for Scene Graphs

The urban perception pipeline employs masked node and edge reconstruction within a GraphMAE framework, critical for learning robust, generalizable scene-level representations. The authors recommend moderate masking rates (10–20%) to optimize the balance between context sensitivity and discriminative power (Liu et al., 22 Dec 2025).

4. Practical Applications and Empirical Results

4.1. 3D Open-vocabulary Semantics

Kassab et al.’s minimal SGFB pipeline achieves comparable or superior semantic segmentation accuracy (mIOU = 0.07, F-mIOU = 0.14, mAcc = 0.09 on Replica) and a threefold improvement in compute efficiency over prior state-of-the-art (0.51 FPS vs. 0.16–0.03; 5 GB VRAM) (Kassab et al., 2024). The feature bank enables efficient, zero-shot category retrieval by direct cosine similarity against text prompts.

4.2. Compositional Video and Action Understanding

In Action Genome, integrating an SGFB into a standard I3D backbone for action recognition yields higher mAP relative to image- or video-only “long-term feature bank” methods (44.3% vs. 42.5% on Charades validation). Oracle ablations suggest that improved object and relationship ground truth quality could further raise performance (60.3% mAP). Few-shot action recognition is bolstered, with 42.7% mAP in the 10-shot regime compared to 39.6% for non-graph LFB approaches (Ji et al., 2019).

4.3. Human-centric Perceptual Modeling

Liu et al. demonstrate that scene-graph feature banks built via masked heterogeneous graph autoencoding deliver a substantial ≈26% mean accuracy gain (87% vs. 71% for ViT-B/16) in urban perception prediction on Place Pulse 2.0. Cross-city generalization remains robust, with only minor drops in accuracy (5.6%) and AUC (3.5%) when transferring to Amsterdam and Tokyo subsets. Qualitative analysis confirms the interpretability benefit: explicit predicates (e.g., (graffiti)–[on]–(wall)) correlate with lower safety, while positive relations boost beauty and liveliness (Liu et al., 22 Dec 2025).

5. Interpretability, Generalization, and Design Insights

Scene graph feature banks retain structured, explicit information about object–predicate–object triples or segment-level descriptors, supporting post hoc introspection and diagnosis. Relational structure and predicate abstraction drive generalization across domains and reduce overfitting to pixel-level biases inherent in image-only models (Liu et al., 22 Dec 2025). Querying SGFBs enables direct analysis of graph statistics and their influence on model outputs.

Key design recommendations include open-vocabulary parsing to preserve rare or location-specific elements, employing moderate masking rates in graph encoding, and storing node-level as well as pooled scene embeddings for multi-scale interpretability. Hierarchical pooling, dynamic predicate weighting, and domain-adaptive frontend tuning are prospective strategies for further advancement (Liu et al., 22 Dec 2025).

6. Limitations, Open Problems, and Future Directions

Current SGFB frameworks confront challenges of feature view-dependence, inadequate multi-view fusion, and dependency on detector/relationship predictor quality. For example, CLIP feature selection and per-segment descriptors in open-vocabulary 3D scenes remain sensitive to viewpoint, with cross-view averaging yielding only modest gains. In the temporal case, oracle experiments indicate that better detection and relationship prediction could significantly raise upper bounds for action recognition (Ji et al., 2019).

A plausible implication is that progress in detector robustness, semantic relationship extraction, and efficient graph neural encoding will further enhance both the expressivity and efficiency of scene-graph feature banks. Emerging avenues include hierarchical subgraph pooling, attention-based predicate reweighting, and domain-aware fine-tuning. The field is also moving toward integrated, end-to-end differentiable pipelines in which SGFBs are trained not only for accuracy, but also for interpretability and sample efficiency across diverse, open-world settings.

Markdown Report Issue Upgrade to Chat

References (3)

The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs (2024)

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs (2019)

From Pixels to Predicates Structuring urban perception with scene graphs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scene Graph Feature Banks.