Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoCoMo: Long-Term Conversational Benchmark

Updated 16 December 2025
  • LoCoMo is a benchmark dataset designed to evaluate long-term conversational memory using rich, context-driven dialogues anchored by persona and event graphs.
  • It integrates multimodal inputs, including text and images, with explicit memory operations that ensure factual recall, temporal reasoning, and causal comprehension.
  • The dataset employs rigorous human verification and diverse evaluation metrics across QA, event summarization, and visual tasks to benchmark both large and on-device models.

LoCoMo is a large-scale, multimodal benchmark specifically designed to evaluate the long-term conversational memory of LLMs and memory-augmented agents. Developed by Maharana et al., it facilitates systematic measurement of factual recall, temporal and causal reasoning, and multi-modal understanding over conversations that extend across weeks or months. The dataset comprises richly annotated, LLM-generated dialogues grounded on agent-specific persona and event graphs, with stringent human verification and multi-task evaluation protocols suited for rigorous agent memory assessment (Maharana et al., 2024, Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025).

1. Dataset Generation and Composition

LoCoMo’s construction utilizes a machine–human hybrid pipeline to create long-running, coherent conversational threads. Each conversation involves two LLM-based agents (L1\mathcal{L}_1 and L2\mathcal{L}_2), both initialized with the same backbone model (e.g., GPT-3.5-turbo) but distinct in two critical inputs: persona (natural language profile pp with identity, habits, relationships) and a temporal event graph G\mathcal{G}, comprising up to 25 temporally and causally linked life events.

Agents are equipped with a generative-agent architecture featuring reflect-respond memory (divided into long-term observation memory Jl\mathcal{J}_l and session-level summaries Hs\mathcal{H}_s) and multimodal capabilities (image sharing and reaction). Dialogues are organized into multiple sessions—on average, each conversation consists of 19.3 sessions, and each session advances through driven prompts reflecting the agents’ event graphs.

Human annotators systematically verify and edit generated content for consistency, correcting approximately 15% of utterances for long-range coherence and substituting roughly 19% of images for semantic alignment. Dialogue content is explicitly aligned to each agent’s event graph, with undiscussed events pruned and timelines adjusted as necessary (Maharana et al., 2024).

Property Value/Range Notes
#Conversations 50 (standard) / 10 (MemLoRA) Pairwise, LLM-generated, human-verified
Avg. Sessions per Conversation 19.3 (min ≈5, max 35, std 50) LoCoMo expanded up to 35 sessions
Avg. Turns per Conversation 304.9 (std 50) / 600 (MemLoRA 10) Overlapping statistics depend on benchmark variant
Avg. Tokens per Conversation 9,209.2 (std 50) / ~16,000 (10) Standard split: 50×9K, MemLoRA split: 10×16K
Modalities Text + Images Embedded image snippets and dialogue text

2. Annotation, Grounding, and Memory Operations

LoCoMo is distinguished by grounding each agent’s discourse in both persona description and an explicit event graph, Gi=(Ei,Li)\mathcal{G}_i = (E_i, L_i), where nodes represent agent-specific events and directed edges encode causal relationships. Session-level grounding ensures that utterances remain consistent with event timelines.

Annotation protocols extend beyond standard transcription—human editors resolve factual, referential, and temporal ambiguities, correct contradictions, and ensure alignment between the dialogue and structured event graphs. For memory-augmented modeling and on-device systems (e.g., MemLoRA), LoCoMo includes explicit annotations for memory-centric operations at each turn:

  • Knowledge Extraction: Turn →\to JSON fact list;
  • Memory Update: Action (ADD/UPDATE/DELETE/NONE) applied to memory, formalized in JSON;
  • Memory Retrieval: At QA points, extraction of relevant fact subsets for answering.

Entity resolution covers PERSON, LOCATION, DATE/TIME, PREFERENCE, and EVENT types, although LoCoMo does not expose a formal schema for entity labels (Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025).

3. Benchmark Tasks and Evaluation Metrics

The benchmark comprises multiple task types, each structured to probe different memory and reasoning faculties:

  1. Long-Context Question Answering (QA): Models must answer queries referencing facts spanning long conversational histories. Question types include:
    • Single-hop (fact-local per session)
    • Multi-hop (requires synthesis across sessions)
    • Temporal (date/inter-event reasoning)
    • Open-domain (preferences and commonsense)
    • Adversarial (unanswerable or trick)

Metrics vary by protocol: - Partial-match F1 for extractive QA (Maharana et al., 2024) - LLM-as-a-judge accuracy: Binary (CORRECT/WRONG), used in Hindsight and MemLoRA studies (Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025) - Recall@k for RAG system context retrieval

  1. Event Summarization: Summarization of life events within specified intervals, compared to event graph ground truth. Metrics include ROUGE (n-gram overlap) and FactScore (atomic fact precision/recall).
  2. Multi-Modal Dialogue Generation: Given dialogue history, generate the next multimodal turn (text + optional image reaction), evaluated by BLEU, ROUGE-L, and MM-Relevance.
  3. Visual Extension (LoCoMo-VQA): For each image in a subset of dialogues (~100 images in MemLoRA extension), three auto-generated visual questions (object counting, color identification, unusual-object detection) are posed. Metrics include exact-match accuracy on single-word answers.
  4. Memory Operations for Small/On-Device Models: In the MemLoRA variant, specialized metrics are defined for extraction/update (composite surface+semantic LL score) and memory-augmented generation (Bini et al., 4 Dec 2025).
Task Input Output Metrics
QA (text) q + retrieved text natural-language answer F1, accuracy
Event Summarization text (interval) event summary ROUGE, FactScore
Multi-modal Gen. text history (+image) text (+image reaction) BLEU, MM-R
Visual QA (LoCoMo-VQA) image + question one-word answer, reason accuracy
Memory Extraction/Update turn, memory facts, memory ops (JSON) LL (composite)

4. Evaluation Results and Agent Architectures

LoCoMo has become a standard for evaluating advanced LLM memory systems, including long-context models, retrieval-augmented generation (RAG), Hindsight's structured memory, and on-device small/vision LLM hybrids.

Key evaluation findings from (Maharana et al., 2024, Latimer et al., 14 Dec 2025, Bini et al., 4 Dec 2025):

  • QA Baselines: Open-access LLMs (Mistral-7B, Llama-2-Chat-70B, GPT-3.5, GPT-4) achieve F1 scores ranging from 13.9 (Mistral) to 32.1 (GPT-4), while human ceiling is 87.9. Long-context and RAG approaches improve scores by 22–66% but still underperform humans by ~56%.
  • Hindsight: Structured, entity-and-temporal aware memory yields accuracy up to 89.61% (Gemini-3 Pro+TEMPR; OSS-120B achieves 85.67%), consistently surpassing open-source baselines (52.9–75.8%).
  • MemLoRA: Equips small models with memory and vision adapters, achieving performance on text-only tasks comparable to 60× larger models and VQA accuracy of 81.3 (vs. 23.7 for caption-based methods).
  • Multi-hop, temporal, and adversarial questions remain difficult; even long-context models achieve near-chance accuracy (≈\approx2% F1) on adversarial QA, and event summarization FactScore F1 remains capped near 46% for strong LLMs.

5. Design Features and Extensions

LoCoMo’s construction embodies several distinctive characteristics:

  • Sessional Structure: Conversations are split into sessions, each anchored in discrete temporal windows that map to event graph nodes.
  • Memory Modeling: Downstream systems leverage LoCoMo for supervised extraction, update, and retrieval of factual state, enabling explicit measurement of agent memory retention and reasoning.
  • Modality Integration: Both text and image snippets are available, supporting strict multimodal reasoning benchmarks. The LoCoMo-VQA extension provides direct evaluation of visual reasoning in conversational context, an essential feature for vision-language memory systems.
  • Privacy and On-Device Suitability: The LoCoMo pipeline and MemLoRA extension support development of compact, local agents operating under constrained computation and memory.

A plausible implication is that LoCoMo supports fine-grained ablation of memory system design—enabling direct measurement of the impact of retrieval method, memory structuring, and backbone scale under realistic, multi-turn conversational loads.

6. Data Splits, Releases, and Research Impact

The 50-conversation standard allows broad benchmarking—these dialogues are analyzed as a closed test suite in Hindsight and MemLoRA studies with no turn-level interleaving. The MemLoRA variant adapts LoCoMo to 10 multi-session dialogues (train/val/test: 7/1/2 by conversation), enabling reproducible on-device experimentation (Bini et al., 4 Dec 2025).

LoCoMo is integral to long-horizon conversational research and memory-augmented architectures, serving as a principal benchmark for open-domain and persistent agent evaluations. The dataset advances state of the art by revealing persistent shortcomings in temporal, causal, and multi-modal memory and reasoning for both commercial and open-source systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LOCOMO Dataset.