Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Retrieval-Augmented Generation (KeySG)

Updated 19 January 2026
  • KeySG is a hierarchical multi-modal framework that integrates scene graph modeling, keyframe selection, and retrieval-augmented generation for efficient 3D reasoning.
  • It employs DBSCAN clustering for keyframe selection and leverages VLMs for multi-modal annotations, effectively encoding geometry and language cues.
  • The framework achieves state-of-the-art performance in semantic segmentation, object retrieval, and complex query answering on diverse 3D benchmarks.

Hierarchical Retrieval-Augmented Generation (KeySG) is a multi-modal retrieval-generation framework designed for large-scale, world-centric reasoning in 3D environments. The KeySG approach combines a hierarchical scene graph representation, keyframe selection, multi-modal annotation using vision-LLMs (VLMs), and a top-down, retrieval-augmented generation (RAG) pipeline that scales to environments whose serialized state would otherwise vastly exceed LLM context limits. It achieves state-of-the-art results across semantic segmentation, object retrieval, and complex query answering benchmarks by jointly leveraging 3D geometry, visual embeddings, and hierarchical, multi-modal context management (Werby et al., 1 Oct 2025).

1. Hierarchical Keyframe Scene Graph Construction

KeySG models a 3D environment as a hierarchical graph G=(V,E)G=(V, E), partitioning nodes VV into:

  • VBV_B: Building
  • VFV_F: Floors
  • VRV_R: Rooms
  • VOV_O: Objects
  • VEV_E: Functional Elements

Each node vVv\in V contains:

  • 3D point cloud PvR3P_v\subset\mathbb{R}^3,
  • Visual embedding evclipRde_v^{\text{clip}}\in\mathbb{R}^d (for objects),
  • Text summary svs_v (for floors/rooms),
  • Child pointers encoding part-of relationships.

This hierarchy encodes: Building → Floor → Room → Object → Functional Element, supporting abstraction, spatial containment, and efficient context pruning.

2. Keyframe Selection and Annotation

Within each room Ri\mathcal{R}_i, KeySG first defines a set of all possible camera poses Di={Pt}\mathcal{D}_i=\{P_t\}, where ttVol(Ri)t_t\in \text{Vol}(\mathcal{R}_i) and PtSE(3)P_t\in SE(3). Each pose is embedded as ft=(tt;wqt)f_t = (t_t; w\cdot q_t), with qtR4q_t\in\mathbb{R}^4 a unit quaternion and ww a rotation-translation scaling parameter.

Standardized pose features v~t=(ftμ)/σ\tilde{v}_t = (f_t - \mu)/\sigma are clustered by DBSCAN. The medoid fkf_k^* of each cluster produces a compact set of keyframes SiS_i that preserves >95% geometric coverage despite SiDi|S_i|\ll|\mathcal{D}_i|. This step is central for scalability and for maximizing scene visibility with a minimal image set.

Each keyframe is then multi-modally annotated using VLMs (e.g., GPT-4V):

  • Visible object and functional element names (projected from 3D to 2D),
  • Free-form geometry-grounded scene description dkd_k,
  • Set of object tags OkO_k and functional tags FkF_k,
  • Per-frame open-vocabulary 2D (MaskCLIP+SAM) segmentation guided by OiFiO_i\cup F_i,
  • Best-view selection and CLIP embedding eoclipe_o^{\text{clip}} per object mask.

Merging all per-keyframe object and function tags yields a globally robust open-vocabulary object mask set for each room, bypassing need for explicit hard-coded inter-object relationships.

3. Hierarchical RAG Pipeline

KeySG's retrieval-augmented generation is organized as a sequence of top-down, cosine-similarity-based retrievals that minimize the LLM context footprint:

  1. Encoding: All text chunks (floor summaries SFS_F, room summaries SRS_R, keyframe descriptions {dk}\{d_k\}, object attribute strings) are embedded and indexed (e.g., via FAISS).
  2. Hierarchical Routing: Given query QQ:
    • Encode QQ as eqe_q.
    • Select floor ff^* maximizing cos(eq,EF[f])\cos(e_q, E_F[f]).
    • Select room rr^* maximizing cos(eq,ER[r])\cos(e_q, E_R[r]) among rooms under ff^*.
    • Retrieve top KfK_f frame descriptions and KoK_o object attribute chunks in rr^*.
  3. Context Assembly: Package context as [SF[f];SR[r];[S_F[f^*]; S_R[r^*]; frame descriptions; object descriptions] for model input.
  4. Generation: Pass to an LLM with a prompt template. The number of context elements per level is fixed such that the total context never approaches the LLM's token limit.

This approach ensures that for any query, only O(1)O(1) summaries per coarse level, and a bounded set of fine-grained annotations, are exposed to the generation backbone.

4. Algorithmic and Retrieval Formulations

The core computations in KeySG include:

  • Medoid Keyframe: fk=argminfjckftckv~jv~t2f_k^* = \arg\min_{f_j\in c_k}\sum_{f_t\in c_k}\| \tilde{v}_j - \tilde{v}_t\|_2
  • Cosine similarity: sim(u,v)=(uv)/(uv)sim(u, v) = (u\cdot v)/(\|u\|\|v\|)
  • Hierarchical retrieval steps: f=argmaxfsim(eq,ef)f^* = \arg\max_f sim(e_q, e_f); similar for rooms/objects.
  • Context window policy: Fixed maximum chunks per hierarchy level ensure \ll LLM's token threshold.

Effectively, this enables sublinear scene graph scaling with scene size and guarantees no context window overflow.

5. Quantitative Benchmarks and Empirical Results

KeySG is evaluated on four canonical 3D scene understanding and retrieval tasks, outperforming prior SOTA on all core metrics:

Task KeySG Result Prior Best
Open-vocab 3D semantic segmentation mAcc 45.81%, F-mIoU 46.16% ~40%
Functional element 3D segmentation Recall@3: 13.33%, @5: 13.64% FunGraph/OpenFunGraph lower
Hier. object retrieval (top-1 Recall) 30.4% ~23%
Complex 3D object grounding (Nr3D) 30.4% ~28%

Notably, gains are strongest for complex, ambiguous queries (e.g., those lacking explicit spatial/color cues), and the end-to-end RAG pipeline outperforms explicit LLM-parsing pipelines on retrieval-based tasks (Werby et al., 1 Oct 2025).

6. Architectural Advantages and Limitations

KeySG introduces several nontrivial strengths:

  • Task-Agnostic Reasoning: Relation types, affordances, and textual descriptions emerge on demand from VLM annotation, not from pre-specified relationship schemas.
  • Hard Context Budgeting: By hierarchical retrieval and strict per-level quota, LLM context never overflows, regardless of global environment size.
  • Robust Multi-modality: All levels synergize geometry (3D), appearance (CLIP, 2D masks), and language for comprehensive world modeling.

Principal limitations are:

  • Offline-Only Construction: VLM/LLM calls are computationally expensive; KeySG is thus a batch, not real-time, system.
  • Static World Hypothesis: The framework encodes static scenes; no dynamic scene or state change tracking implemented.
  • Model-dependence: Quality of annotations and retrieval is governed by open-vocabulary segmentation and VLM grounding.

Potential extensions identified include incremental online scene graph updates (dynamic keyframe management), continual VLM-based annotation for non-static scenes, and retrieval embedding optimization for specific 3D reasoning/retrieval objectives.

7. Positioning within the Hierarchical RAG Ecosystem

KeySG's hierarchical RAG paradigm builds on and extends prior hierarchical retrieval-generation work, such as Wiki-LLaVA (hierarchical multi-modal retrieval over Wikipedia with CLIP-based entity/title step followed by chunk-level retrieval) (Caffagni et al., 2024), and recent hierarchical community-graph RAG (e.g., T-Retriever’s structure-semantics joint entropy partitions (Wei et al., 8 Jan 2026), ArchRAG's attributed community trees (Wang et al., 14 Feb 2025), and EEG-MedRAG's multi-level n-ary hypergraphs for medical decision QA (Wang et al., 19 Aug 2025)). KeySG's core innovation is the tight coupling of environmental structure and multi-modal, keyframe-centric annotation—enabling more general, model-agnostic, and scalable world representations specifically for 3D scene understanding and robotic reasoning.

In sum, Hierarchical Retrieval-Augmented Generation in KeySG provides the necessary semantic and computational abstraction for complex, large-scale, open-vocabulary reasoning over 3D spaces using LLMs, outperforming prior approaches both in accuracy and efficiency due to its principled, multi-modal, and context-efficient design (Werby et al., 1 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Retrieval-Augmented Generation (KeySG).