LLM-guided Hierarchical Retrieval

Published 15 Oct 2025 in cs.IR and cs.LG | (2510.13217v1)

Abstract: Modern IR systems are increasingly tasked with answering complex, multi-faceted queries that require deep reasoning rather than simple keyword or semantic matching. While LLM-based IR has shown great promise, the prevailing retrieve-then-rerank paradigm inherits the limitations of embedding-based retrieval; parametric generative approaches are difficult to update with new information; and long-context methods that place the entire corpus in context are computationally infeasible for large document collections. To address these challenges, we introduce LATTICE, a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity by imposing a semantic tree structure on the corpus. Our approach consists of two stages: (1) an offline phase that organizes the corpus into a semantic hierarchy via either a bottom-up agglomerative strategy or a top-down divisive strategy using multi-level summaries and (2) an online traversal phase where a search LLM navigates this tree. A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy, making cross-branch and cross-level comparisons difficult. To overcome this, we propose a traversal algorithm that estimates calibrated latent relevance scores from local LLM outputs and aggregates them into a global path relevance metric. Our training-free framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark, demonstrating up to 9% improvement in Recall@100 and 5% in nDCG@10 over the next best zero-shot baseline. Furthermore, compared to the fine-tuned SOTA method DIVER-v2, LATTICE attains comparable results on BRIGHT subsets that use a static corpus for evaluation.

Abstract PDF Upgrade to Chat

Summary

The paper introduces LATTICE, a zero-shot hierarchical retrieval framework where LLMs steer semantic tree traversal for efficient, reasoning-driven search.
It employs a two-stage process with offline semantic tree construction using both bottom-up and top-down clustering, and online calibrated best-first LLM-guided traversal.
Experimental results show improved nDCG@10 and Recall@100 metrics over standard methods with logarithmic search complexity, though challenges remain for dynamic corpora.

LLM-guided Hierarchical Retrieval: An Expert Review

Introduction

The work "LLM-guided Hierarchical Retrieval" (2510.13217) introduces LATTICE, a hierarchical retrieval framework where LLMs act as central traversal agents to navigate semantically structured document corpora. LATTICE directly leverages the reasoning capabilities of LLMs to replace standard embedding-based or pipeline IR paradigms, addressing inherent limitations of retrieve-then-rerank and generative approaches. The core contribution is a training-free, zero-shot algorithm that organizes unstructured text corpora into hierarchical semantic trees, enabling logarithmic search complexity with calibrated LLM relevance signal propagation, and resulting in robust, efficient retrieval for reasoning-intensive queries.

Framework Overview

LATTICE's architecture is a two-stage process:

Offline: Corpus Structuring via Semantic Trees The document corpus is organized as a semantic hierarchy. Two LLM-driven strategies are explored: bottom-up agglomerative clustering using embeddings and summaries, and top-down divisive clustering with multi-level LLM-generated summaries.
Online: LLM-guided Tree Traversal At query time, a search LLM executes a best-first traversal of the tree, making decisions at each node by evaluating local candidate clusters with globally calibrated path relevance scores. Score calibration compensates for the inherent noise and context dependency in LLM outputs, ensuring scores are comparable across disparate branches and depths.
Figure 1: LATTICE consists of offline construction of a semantic tree and online best-first LLM-guided traversal with calibrated path relevance.

This approach leads to a traversal cost logarithmic in corpus size, a significant improvement over naive approaches for large-scale corpora, while maintaining high relevance precision due to explicit LLM-guided reasoning at each step.

Methodology

Semantic Tree Construction

Bottom-Up Agglomerative: Recursively clusters documents using vector embeddings (Gecko [lee2024gecko]) and generates LLM-based summaries for each internal node. This exploits local document relations and is particularly effective when datasets have underlying structural semantics (e.g., StackExchange).
Top-Down Divisive: Begins with the entire corpus and recursively partitions using LLMs, informed by hierarchically-increasing summary complexity, until clusters satisfy a maximum branching factor. This enables conceptual clustering when no explicit structure exists.

Tree Traversal and Calibration

The search LLM is prompted with slates of candidate nodes at each expansion step, outputting listwise relevance scores, augmented with cross-branch and prior high-scoring leaves for global calibration.
Latent relevance estimation leverages a linear transformation (scaling and offset via MLE), minimizing error across all observed scores, and enabling smoothing via an exponentially weighted moving average (EWMA) controlled by a tunable momentum hyperparameter ( $\alpha$ ).
A small beam size and moderate search iterations maximize sequential reasoning while maintaining compute efficiency.
Figure 2: LATTICE's traversal visualized on a real BRIGHT benchmark query, highlighting LLM step-by-step reasoning traces and global path relevance updates.

Experimental Results

Evaluation on the BRIGHT benchmark [Su2024-xk], encompassing 12 reasoning-intensive IR tasks across multiple domains and up to 420k documents per corpus, confirms the efficacy of LATTICE:

Zero-Shot Ranking (nDCG@10):

LATTICE achieves average nDCG@10 up to 51.6 on StackExchange subsets, surpassing a strong Gemini-2.5-flash reranking baseline (47.4) and approaching fine-tuned SOTA system DIVER-v2 (52.2). For several domains (Economics, Robotics), it obtains the highest scores among all methods evaluated.

Recall@100:

LATTICE records an average of 74.8, exceeding BM25 (+9.5 points) and advanced embedding-based retrievers such as ReasonIR-8B (+4 points).

Computational Efficiency:

Cost-performance analysis shows that, as search budget increases (in terms of LLM input tokens), LATTICE scales more favorably than rerankers, which plateau early, while LATTICE's scores continue to improve linearly with traversal depth.

Figure 3: Recall@100 (left) demonstrates consistent LATTICE gains across BRIGHT's StackExchange domains over BM25 and ReasonIR-8B; nDCG@10 vs computational cost (right) shows superior cost-effectiveness for LATTICE compared to reranking.

LATTICE's performance degrades on tasks with dynamic, query-dependent corpora (e.g., certain coding/theorem domains), as pre-computed hierarchy summaries may become invalid when the underlying document index changes; hence, the static nature of the semantic tree is a current limitation when faced with runtime corpus mutability.

Ablation and Analysis

Score Calibration: Removing latent score calibration yields a consistent nDCG@10 decrease ( $\sim$ 2 points), confirming the necessity of slate-independent global comparison for reliable traversal.
Path Relevance Smoothing ( $\alpha$ ): Disabling path smoothing leads to largest drops in retrieval quality among all ablations.
Reasoning Budget: Restricting the LLM’s stepwise reasoning (forcing it to output only scores) results in non-trivial nDCG@10 drops, demonstrating the importance of explicit LLM-generated rationales for high-level retrieval.
Construction Strategy Impact: Optimal construction strategy depends on corpus characteristics—bottom-up is preferable for part-whole document hierarchies, while top-down divisive works better for conceptually uniform, unstructured corpora.

Practical and Theoretical Implications

The LATTICE framework reconfigures the role of LLMs in IR by deeply integrating them as the search agent, not merely as rerankers or preprocessors. This design demonstrates that logarithmic retrieval complexity with expert-level, fine-grained reasoning is achievable in zero-shot settings, with zero task-specific fine-tuning.

Practically, LATTICE can be deployed for reasoning-intensive benchmarks and other knowledge-heavy retrieval settings where deep, abstract query-document relationships must be discovered in large corpora, without incurring the cost of task-specific retriever training. The global calibration and path-relevance mechanisms generalize beyond tree traversals, suggesting utility in any structured or semi-structured corpus navigation problem.

On the theoretical front, LATTICE challenges the permanence of the retrieve-then-rerank performance ceiling by shifting the initial search decision mechanism to an LLM agent operating over semantically compositional tree scaffolding, effectively expanding the effective search horizon while introducing noise robustness through latent calibration.

Future Directions

Key open research avenues include:

Dynamic Tree Maintenance: Efficient methods for updating summaries and calibrations in response to runtime corpus updates, adapting the hierarchical structure to evolving document sets.
Probabilistic Score Modeling: More advanced models beyond linear MLE calibration, potentially Bayesian frameworks or structured latent variable approaches, may further improve traversal precision.
Reinforcement Learning Traversal: Jointly learning a traversal policy via RL with path-level rewards tailored to downstream reasoning or question-answering objectives.
Scalable Construction: Hybrid construction algorithms that use traditional unsupervised clustering at scale for lower layers, augmented with LLM-based summarization only at higher, abstract layers for improved computational tractability.

Conclusion

"LLM-guided Hierarchical Retrieval" establishes a rigorous, LLM-native alternative to established IR paradigms, demonstrating competitive or superior zero-shot retrieval in reasoning-intensive benchmarks by directly integrating LLMs as the search controller over semantic trees. Core innovations in latent calibration, path relevance propagation, and LLM-guided tree construction underpin this result. While the static tree structure presents adaptation challenges for dynamic corpora, the framework provides a tractable path toward robust, semantic, and agentic retrieval systems, setting the foundation for further research at the intersection of LLMs, hierarchical structures, and dynamic corpus reasoning.

Markdown Report Issue