Hierarchical Retrieval-Augmented Generation (KeySG)

Updated 19 January 2026

KeySG is a hierarchical multi-modal framework that integrates scene graph modeling, keyframe selection, and retrieval-augmented generation for efficient 3D reasoning.
It employs DBSCAN clustering for keyframe selection and leverages VLMs for multi-modal annotations, effectively encoding geometry and language cues.
The framework achieves state-of-the-art performance in semantic segmentation, object retrieval, and complex query answering on diverse 3D benchmarks.

Hierarchical Retrieval-Augmented Generation (KeySG) is a multi-modal retrieval-generation framework designed for large-scale, world-centric reasoning in 3D environments. The KeySG approach combines a hierarchical scene graph representation, keyframe selection, multi-modal annotation using vision-LLMs (VLMs), and a top-down, retrieval-augmented generation (RAG) pipeline that scales to environments whose serialized state would otherwise vastly exceed LLM context limits. It achieves state-of-the-art results across semantic segmentation, object retrieval, and complex query answering benchmarks by jointly leveraging 3D geometry, visual embeddings, and hierarchical, multi-modal context management (Werby et al., 1 Oct 2025).

1. Hierarchical Keyframe Scene Graph Construction

KeySG models a 3D environment as a hierarchical graph $G=(V, E)$ , partitioning nodes $V$ into:

$V_B$ : Building
$V_F$ : Floors
$V_R$ : Rooms
$V_O$ : Objects
$V_E$ : Functional Elements

Each node $v\in V$ contains:

3D point cloud $P_v\subset\mathbb{R}^3$ ,
Visual embedding $e_v^{\text{clip}}\in\mathbb{R}^d$ (for objects),
Text summary $s_v$ (for floors/rooms),
Child pointers encoding part-of relationships.

This hierarchy encodes: Building → Floor → Room → Object → Functional Element, supporting abstraction, spatial containment, and efficient context pruning.

2. Keyframe Selection and Annotation

Within each room $\mathcal{R}_i$ , KeySG first defines a set of all possible camera poses $\mathcal{D}_i=\{P_t\}$ , where $t_t\in \text{Vol}(\mathcal{R}_i)$ and $P_t\in SE(3)$ . Each pose is embedded as $f_t = (t_t; w\cdot q_t)$ , with $q_t\in\mathbb{R}^4$ a unit quaternion and $w$ a rotation-translation scaling parameter.

Standardized pose features $\tilde{v}_t = (f_t - \mu)/\sigma$ are clustered by DBSCAN. The medoid $f_k^*$ of each cluster produces a compact set of keyframes $S_i$ that preserves >95% geometric coverage despite $|S_i|\ll|\mathcal{D}_i|$ . This step is central for scalability and for maximizing scene visibility with a minimal image set.

Each keyframe is then multi-modally annotated using VLMs (e.g., GPT-4V):

Visible object and functional element names (projected from 3D to 2D),
Free-form geometry-grounded scene description $d_k$ ,
Set of object tags $O_k$ and functional tags $F_k$ ,
Per-frame open-vocabulary 2D (MaskCLIP+SAM) segmentation guided by $O_i\cup F_i$ ,
Best-view selection and CLIP embedding $e_o^{\text{clip}}$ per object mask.

Merging all per-keyframe object and function tags yields a globally robust open-vocabulary object mask set for each room, bypassing need for explicit hard-coded inter-object relationships.

3. Hierarchical RAG Pipeline

KeySG's retrieval-augmented generation is organized as a sequence of top-down, cosine-similarity-based retrievals that minimize the LLM context footprint:

Encoding: All text chunks (floor summaries $S_F$ , room summaries $S_R$ , keyframe descriptions $\{d_k\}$ , object attribute strings) are embedded and indexed (e.g., via FAISS).
Hierarchical Routing: Given query $Q$ $Q$ :
- Encode $Q$ as $e_q$ .
- Select floor $f^*$ maximizing $\cos(e_q, E_F[f])$ .
- Select room $r^*$ maximizing $\cos(e_q, E_R[r])$ among rooms under $f^*$ .
- Retrieve top $K_f$ frame descriptions and $K_o$ object attribute chunks in $r^*$ .
Context Assembly: Package context as $[S_F[f^*]; S_R[r^*];$ frame descriptions; object descriptions] for model input.
Generation: Pass to an LLM with a prompt template. The number of context elements per level is fixed such that the total context never approaches the LLM's token limit.

This approach ensures that for any query, only $O(1)$ summaries per coarse level, and a bounded set of fine-grained annotations, are exposed to the generation backbone.

4. Algorithmic and Retrieval Formulations

The core computations in KeySG include:

Medoid Keyframe: $f_k^* = \arg\min_{f_j\in c_k}\sum_{f_t\in c_k}\| \tilde{v}_j - \tilde{v}_t\|_2$
Cosine similarity: $sim(u, v) = (u\cdot v)/(\|u\|\|v\|)$
Hierarchical retrieval steps: $f^* = \arg\max_f sim(e_q, e_f)$ ; similar for rooms/objects.
Context window policy: Fixed maximum chunks per hierarchy level ensure $\ll$ LLM's token threshold.

Effectively, this enables sublinear scene graph scaling with scene size and guarantees no context window overflow.

5. Quantitative Benchmarks and Empirical Results

KeySG is evaluated on four canonical 3D scene understanding and retrieval tasks, outperforming prior SOTA on all core metrics:

Task	KeySG Result	Prior Best
Open-vocab 3D semantic segmentation	mAcc 45.81%, F-mIoU 46.16%	~40%
Functional element 3D segmentation	Recall@3: 13.33%, @5: 13.64%	FunGraph/OpenFunGraph lower
Hier. object retrieval (top-1 Recall)	30.4%	~23%
Complex 3D object grounding (Nr3D)	30.4%	~28%

Notably, gains are strongest for complex, ambiguous queries (e.g., those lacking explicit spatial/color cues), and the end-to-end RAG pipeline outperforms explicit LLM-parsing pipelines on retrieval-based tasks (Werby et al., 1 Oct 2025).

6. Architectural Advantages and Limitations

KeySG introduces several nontrivial strengths:

Task-Agnostic Reasoning: Relation types, affordances, and textual descriptions emerge on demand from VLM annotation, not from pre-specified relationship schemas.
Hard Context Budgeting: By hierarchical retrieval and strict per-level quota, LLM context never overflows, regardless of global environment size.
Robust Multi-modality: All levels synergize geometry (3D), appearance (CLIP, 2D masks), and language for comprehensive world modeling.

Principal limitations are:

Offline-Only Construction: VLM/LLM calls are computationally expensive; KeySG is thus a batch, not real-time, system.
Static World Hypothesis: The framework encodes static scenes; no dynamic scene or state change tracking implemented.
Model-dependence: Quality of annotations and retrieval is governed by open-vocabulary segmentation and VLM grounding.

Potential extensions identified include incremental online scene graph updates (dynamic keyframe management), continual VLM-based annotation for non-static scenes, and retrieval embedding optimization for specific 3D reasoning/retrieval objectives.

7. Positioning within the Hierarchical RAG Ecosystem

KeySG's hierarchical RAG paradigm builds on and extends prior hierarchical retrieval-generation work, such as Wiki-LLaVA (hierarchical multi-modal retrieval over Wikipedia with CLIP-based entity/title step followed by chunk-level retrieval) (Caffagni et al., 2024), and recent hierarchical community-graph RAG (e.g., T-Retriever’s structure-semantics joint entropy partitions (Wei et al., 8 Jan 2026), ArchRAG's attributed community trees (Wang et al., 14 Feb 2025), and EEG-MedRAG's multi-level n-ary hypergraphs for medical decision QA (Wang et al., 19 Aug 2025)). KeySG's core innovation is the tight coupling of environmental structure and multi-modal, keyframe-centric annotation—enabling more general, model-agnostic, and scalable world representations specifically for 3D scene understanding and robotic reasoning.

In sum, Hierarchical Retrieval-Augmented Generation in KeySG provides the necessary semantic and computational abstraction for complex, large-scale, open-vocabulary reasoning over 3D spaces using LLMs, outperforming prior approaches both in accuracy and efficiency due to its principled, multi-modal, and context-efficient design (Werby et al., 1 Oct 2025).