HippoRAG 2: Enhanced Memory for LLMs
- The paper introduces HippoRAG 2, a framework that unifies dense and sparse retrieval to boost factual recall, associative reasoning, and sense-making in LLMs.
- It employs a dual-node knowledge graph with passage and phrase nodes, enhanced by Personalized PageRank and LLM-based triple filtering for deep contextual retrieval.
- Benchmark experiments demonstrate a 7-point F1 gain over embedding retrievers on associative tasks while significantly reducing LLM token usage.
HippoRAG 2 is a non-parametric continual learning framework for LLMs that augments retrieval-augmented generation (RAG) with explicit, context-rich memory mechanisms, specifically constructed knowledge graphs (KGs) and Personalized PageRank (PPR), achieving superior performance in factual recall, sense-making, and associative retrieval. Designed to mimic the dynamic, interconnected character of human long-term memory, HippoRAG 2 builds upon its predecessor HippoRAG by incorporating deeper passage integration, more contextualized queryātriple linking, and an online LLM loop both for knowledge extraction/filtering and answer generation. The system addresses deficiencies in previous graph-augmented RAG architectures, which often compromised factual recall or sense-making for associativity, by unifying dense and sparse representations, adding passage nodes to the KG, and leveraging LLM-powered recognition filtering. In benchmark experiments, HippoRAG 2 lifts associative QA F1 by 7 points over state-of-the-art embedding retrievers, while also excelling in factual and discourse-oriented tasks (GutiĆ©rrez et al., 20 Feb 2025).
1. Motivation and Design Principles
The core motivation for HippoRAG 2 is to endow LLMs with a non-parametric continual memory system capable of context-sensitive knowledge acquisition, recall, and integrationākey features of human memory. Standard RAG workflows, reliant on nearest-neighbor vector retrieval, struggle with catastrophic forgetting in fine-tuning and lack the capacity for multi-hop associations (āassociativityā) and deep context interpretation (āsense-makingā). Structure-augmented approaches with knowledge graphs partly address associativity but have yielded trade-offs, usually reducing performance on basic factual QA. HippoRAG 2 aims to unify memory recall mechanisms to optimize all three memory modalitiesāfactual, associative, and sense-makingāthrough innovations including:
- Denseāsparse integration: Passage nodes and phrase nodes co-exist in the KG.
- Deep contextualization: Queries directly link to full triples.
- Recognition memory: LLM-based filtering of relevant triples.
- Online LLM loop: LLM handles both knowledge graph maintenance and final-answer reading.
2. Memory Graph Construction
The HippoRAG 2 knowledge graph includes both phrase nodes (text spans from OpenIE triple extraction) and passage nodes (full passages or documents from the corpus). This dual-node structure enables denseāsparse integration. Edges fall into three categories:
- Relation edges: Connect phrase nodes for each KG triple , with undirected weight .
- Synonym edges: Between phrase nodes if their embeddings satisfy similarity , ; .
- Context edges: Link every phrase node extracted from a passage to that passage node; .
The graph is represented as an adjacency matrix over the full node set , normalized row-wise:
where . The KG is static offline; online, only the personalization vector for PPR changes following LLM-driven triple filtering.
3. Personalized PageRank and Retrieval
HippoRAG 2 applies PPR over the normalized adjacency matrix , producing a contextually ranked retrieval:
where is fixed and is a personalization vector defined via relevant phrase and passage seed nodes informed by the query and triple scores. The vector is:
with being the average retrieval score for triple-generating phrase nodes or weighted embedding similarity for passage nodes. PPR is solved by power iteration. The top-ranked passage nodes select the contextual passages for the LLM reader downstream.
4. Deep Passage Integration and Prompt Construction
āDeepā passage integration involves encoding both passage and query into dense vectors , concatenated to form the final context-aware prompt embedding:
In transformer-based LLMs, this concatenation is consumed in encoder layers or injected into cross-attention for the memory bank at each generation step:
Practically, passages are prepended (delimited) in natural language before the query for answering.
5. Online Retrieval and Generation Workflow
The online operational logic comprises four sequential steps:
a. QueryāTriple Linking and Passage Ranking: Compute embeddings for the query and match against KG triple texts and passage embeddings; retrieve top- triples and passages .
b. Recognition Memory (Triple Filtering): Feed the query plus candidate triples into a secondary LLM (e.g., Llama-3.3-70B-Instruct) with a prompt designed for filtering out triples irrelevant to the query (see paper Appendix A). Resulting set is .
c. Seed Node Selection and PPR: Extract up to five phrase nodes from , score based on triple score, plus all passage nodes with scaled embedding similarity; construct for PPR, returning top- passages .
d. Final Generation: Concatenate as context and prompt the LLM for the answer output.
Optional post-processing allows addition of new high-confidence facts back into the KG via OpenIE and synonym detection, supporting continual learning.
6. Experimental Protocol and Evaluation
HippoRAG 2 was empirically validated on three major task types:
| Task Type | Benchmarks | Metrics |
|---|---|---|
| Factual Recall | NaturalQuestions (NQ), PopQA | Recall@5, EM, F1 |
| Associativity | MuSiQue, 2Wiki, HotpotQA, LV-Eval | Recall@5, EM, F1 |
| Sense-making | NarrativeQA | EM, F1 |
Passage Recall@5 measures percentage of queries with supporting passage retrieved in top 5. Exact Match (EM) and F1 reflect generation accuracy. Key result: on associative benchmarks, HippoRAG 2 achieves a mean +7 F1 gain over NV-Embed-v2, the embedding retriever baseline. Factual and sense-making tasks also show slight improvements.
7. Comparative Evaluation and Limitations
Compared to state-of-the-art pure-embedding RAG (NV-Embed-v2 + LLM), HippoRAG 2 lifts multi-hop QA F1 (e.g., MuSiQue: 44.8ā51.9) and Recall@5 (e.g., MuSiQue: 69.7%ā74.7%, 2Wiki: 76.5%ā90.4%). Structure-based approaches (RAPTOR, GraphRAG, LightRAG, HippoRAG) may improve associativity or sense-making, but generally reduce performance on simple QA by 5ā10 F1, which HippoRAG 2 avoids. It also requires significantly fewer LLM tokens for indexing (e.g., 9M versus 115M for MuSiQue). Nevertheless, the LLM triple filter exhibits 7% miss rate, and sparse seeds can limit PPR effectiveness.
8. Future Directions
Future work concentrates on:
- Integrating episodic memory for extended dialogue contexts.
- Automatic consolidation/pruning of memory over large document collections.
- Dynamic graph adaptation reflecting ongoing conversation context.
These directions aim to further approximate human-like conversational memory and scalability in continual learning (GutiƩrrez et al., 20 Feb 2025).