LoCoMo: Long-Term Conversational Benchmark

Updated 16 December 2025

LoCoMo is a benchmark dataset designed to evaluate long-term conversational memory using rich, context-driven dialogues anchored by persona and event graphs.
It integrates multimodal inputs, including text and images, with explicit memory operations that ensure factual recall, temporal reasoning, and causal comprehension.
The dataset employs rigorous human verification and diverse evaluation metrics across QA, event summarization, and visual tasks to benchmark both large and on-device models.

LoCoMo is a large-scale, multimodal benchmark specifically designed to evaluate the long-term conversational memory of LLMs and memory-augmented agents. Developed by Maharana et al., it facilitates systematic measurement of factual recall, temporal and causal reasoning, and multi-modal understanding over conversations that extend across weeks or months. The dataset comprises richly annotated, LLM-generated dialogues grounded on agent-specific persona and event graphs, with stringent human verification and multi-task evaluation protocols suited for rigorous agent memory assessment (Maharana et al., 2024, Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025).

1. Dataset Generation and Composition

LoCoMo’s construction utilizes a machine–human hybrid pipeline to create long-running, coherent conversational threads. Each conversation involves two LLM-based agents ( $\mathcal{L}_1$ and $\mathcal{L}_2$ ), both initialized with the same backbone model (e.g., GPT-3.5-turbo) but distinct in two critical inputs: persona (natural language profile $p$ with identity, habits, relationships) and a temporal event graph $\mathcal{G}$ , comprising up to 25 temporally and causally linked life events.

Agents are equipped with a generative-agent architecture featuring reflect-respond memory (divided into long-term observation memory $\mathcal{J}_l$ and session-level summaries $\mathcal{H}_s$ ) and multimodal capabilities (image sharing and reaction). Dialogues are organized into multiple sessions—on average, each conversation consists of 19.3 sessions, and each session advances through driven prompts reflecting the agents’ event graphs.

Human annotators systematically verify and edit generated content for consistency, correcting approximately 15% of utterances for long-range coherence and substituting roughly 19% of images for semantic alignment. Dialogue content is explicitly aligned to each agent’s event graph, with undiscussed events pruned and timelines adjusted as necessary (Maharana et al., 2024).

Property	Value/Range	Notes
#Conversations	50 (standard) / 10 (MemLoRA)	Pairwise, LLM-generated, human-verified
Avg. Sessions per Conversation	19.3 (min ≈5, max 35, std 50)	LoCoMo expanded up to 35 sessions
Avg. Turns per Conversation	304.9 (std 50) / 600 (MemLoRA 10)	Overlapping statistics depend on benchmark variant
Avg. Tokens per Conversation	9,209.2 (std 50) / ~16,000 (10)	Standard split: 50×9K, MemLoRA split: 10×16K
Modalities	Text + Images	Embedded image snippets and dialogue text

2. Annotation, Grounding, and Memory Operations

LoCoMo is distinguished by grounding each agent’s discourse in both persona description and an explicit event graph, $\mathcal{G}_i = (E_i, L_i)$ , where nodes represent agent-specific events and directed edges encode causal relationships. Session-level grounding ensures that utterances remain consistent with event timelines.

Annotation protocols extend beyond standard transcription—human editors resolve factual, referential, and temporal ambiguities, correct contradictions, and ensure alignment between the dialogue and structured event graphs. For memory-augmented modeling and on-device systems (e.g., MemLoRA), LoCoMo includes explicit annotations for memory-centric operations at each turn:

Knowledge Extraction: Turn $\to$ JSON fact list;
Memory Update: Action (ADD/UPDATE/DELETE/NONE) applied to memory, formalized in JSON;
Memory Retrieval: At QA points, extraction of relevant fact subsets for answering.

Entity resolution covers PERSON, LOCATION, DATE/TIME, PREFERENCE, and EVENT types, although LoCoMo does not expose a formal schema for entity labels (Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025).

3. Benchmark Tasks and Evaluation Metrics

The benchmark comprises multiple task types, each structured to probe different memory and reasoning faculties:

Long-Context Question Answering (QA): Models must answer queries referencing facts spanning long conversational histories. Question types include:
- Single-hop (fact-local per session)
- Multi-hop (requires synthesis across sessions)
- Temporal (date/inter-event reasoning)
- Open-domain (preferences and commonsense)
- Adversarial (unanswerable or trick)

Metrics vary by protocol: - Partial-match F1 for extractive QA (Maharana et al., 2024) - LLM-as-a-judge accuracy: Binary (CORRECT/WRONG), used in Hindsight and MemLoRA studies (Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025) - Recall@k for RAG system context retrieval

Event Summarization: Summarization of life events within specified intervals, compared to event graph ground truth. Metrics include ROUGE (n-gram overlap) and FactScore (atomic fact precision/recall).
Multi-Modal Dialogue Generation: Given dialogue history, generate the next multimodal turn (text + optional image reaction), evaluated by BLEU, ROUGE-L, and MM-Relevance.
Visual Extension (LoCoMo-VQA): For each image in a subset of dialogues (~100 images in MemLoRA extension), three auto-generated visual questions (object counting, color identification, unusual-object detection) are posed. Metrics include exact-match accuracy on single-word answers.
Memory Operations for Small/On-Device Models: In the MemLoRA variant, specialized metrics are defined for extraction/update (composite surface+semantic $L$ score) and memory-augmented generation (Bini et al., 4 Dec 2025).

Task	Input	Output	Metrics
QA (text)	q + retrieved text	natural-language answer	F1, accuracy
Event Summarization	text (interval)	event summary	ROUGE, FactScore
Multi-modal Gen.	text history (+image)	text (+image reaction)	BLEU, MM-R
Visual QA (LoCoMo-VQA)	image + question	one-word answer, reason	accuracy
Memory Extraction/Update	turn, memory	facts, memory ops (JSON)	$L$ (composite)

4. Evaluation Results and Agent Architectures

LoCoMo has become a standard for evaluating advanced LLM memory systems, including long-context models, retrieval-augmented generation (RAG), Hindsight's structured memory, and on-device small/vision LLM hybrids.

Key evaluation findings from (Maharana et al., 2024, Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025):

QA Baselines: Open-access LLMs (Mistral-7B, Llama-2-Chat-70B, GPT-3.5, GPT-4) achieve F1 scores ranging from 13.9 (Mistral) to 32.1 (GPT-4), while human ceiling is 87.9. Long-context and RAG approaches improve scores by 22–66% but still underperform humans by ~56%.
Hindsight: Structured, entity-and-temporal aware memory yields accuracy up to 89.61% (Gemini-3 Pro+TEMPR; OSS-120B achieves 85.67%), consistently surpassing open-source baselines (52.9–75.8%).
MemLoRA: Equips small models with memory and vision adapters, achieving performance on text-only tasks comparable to 60× larger models and VQA accuracy of 81.3 (vs. 23.7 for caption-based methods).
Multi-hop, temporal, and adversarial questions remain difficult; even long-context models achieve near-chance accuracy ( $\mathcal{L}_2$ 02% F1) on adversarial QA, and event summarization FactScore F1 remains capped near 46% for strong LLMs.

5. Design Features and Extensions

LoCoMo’s construction embodies several distinctive characteristics:

Sessional Structure: Conversations are split into sessions, each anchored in discrete temporal windows that map to event graph nodes.
Memory Modeling: Downstream systems leverage LoCoMo for supervised extraction, update, and retrieval of factual state, enabling explicit measurement of agent memory retention and reasoning.
Modality Integration: Both text and image snippets are available, supporting strict multimodal reasoning benchmarks. The LoCoMo-VQA extension provides direct evaluation of visual reasoning in conversational context, an essential feature for vision-language memory systems.
Privacy and On-Device Suitability: The LoCoMo pipeline and MemLoRA extension support development of compact, local agents operating under constrained computation and memory.

A plausible implication is that LoCoMo supports fine-grained ablation of memory system design—enabling direct measurement of the impact of retrieval method, memory structuring, and backbone scale under realistic, multi-turn conversational loads.

6. Data Splits, Releases, and Research Impact

The 50-conversation standard allows broad benchmarking—these dialogues are analyzed as a closed test suite in Hindsight and MemLoRA studies with no turn-level interleaving. The MemLoRA variant adapts LoCoMo to 10 multi-session dialogues (train/val/test: 7/1/2 by conversation), enabling reproducible on-device experimentation (Bini et al., 4 Dec 2025).

LoCoMo is integral to long-horizon conversational research and memory-augmented architectures, serving as a principal benchmark for open-domain and persistent agent evaluations. The dataset advances state of the art by revealing persistent shortcomings in temporal, causal, and multi-modal memory and reasoning for both commercial and open-source systems.