AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

Published 17 Oct 2025 in cs.AI | (2510.15261v1)

Abstract: Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AUGUSTUS, which integrates multimodal signals with human-inspired memory processes.
It employs a four-stage methodology—encode, store, retrieve, act—with graph-structured contextual memory for efficient, concept-driven retrieval.
Experimental results demonstrate a 3.5× faster retrieval speed and enhanced performance in ImageNet classification and conversational benchmarks.

AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

Introduction

The paper "AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory" (2510.15261) introduces AUGUSTUS, a novel multimodal agent system inspired by human cognitive models of memory, particularly retrieval-augmented generation (RAG). This system innovatively integrates multimodal signals within its memory architecture, aiming to overcome the limitations of existing systems which predominantly utilize text-based memory storage. AUGUSTUS operates through four main stages—encode, store in memory, retrieve, and act—structured to mimic human cognitive processes as described in cognitive neuroscience. This approach not only enhances efficiency but also showcases improvements in processing tasks such as ImageNet classification where it outperforms traditional systems like MemGPT.

System Architecture

AUGUSTUS is engineered as a loop of operations grounded in cognitive science, designed to simulate human memory processes. The loop begins with the encoding stage, where diverse modalities (text, images, audio) are transcribed into textual format using powerful models like Video-LLaVA and WhisperX. This text-based transformation facilitates storage and retrieval within the system's memory architecture, which mirrors the constructs of human episodic and semantic memory.

The storage component is divided into in-context memory (akin to working memory) and external memory databases such as recall and contextual memory. The latter utilizes a novel graph-structured contextual memory allowing for efficient, concept-driven retrieval, leveraging semantic tagging. This architectural choice is critical for minimizing retrieval times and optimizing storage efficiency. Contrastively, recall memory logs historical interactions, ensuring comprehensive chronological context.

Figure illustrations provide insights into memory organization and retrieval mechanics within AUGUSTUS:

Figure 1: Snapshot of a conversation between a user and AUGUSTUS about the user's pet. Our system calls different functions corresponding to the four stages of operation to engage with the user while thinking internally.

Figure 2: Organization of Information in the Contextual Memory. AUGUSTUS conceptualizes information into semantic tags connected with the corresponding context node, creating associations among the tags.

Memory Retrieval and Personalization

The retrieval phase employs the Contextual-Personalized (CoPe) search algorithm, designed for concept-driven retrieval from the hierarchical contextual memory. By using a clustering mechanism over semantic tags, CoPe reduces the search space and enhances retrieval efficiency, exhibiting a 3.5× faster performance than traditional RAG setups while maintaining high accuracy in large-scale tasks like ImageNet classification.

Figure 3: Retrieval with CoPe Search. Given a query, AUGUSTUS retrieves relevant concept (pet; dog; corgi), followed by concept-driven context retrieval to send a personalized response to the user.

Additionally, AUGUSTUS's personalized response capability underscores its utility in ongoing user interactions, tailoring outputs based on retrieved contextual knowledge.

Figure 4: Personalized response to the user. AUGUSTUS retrieves relevant information from the contextual memory by calling the cope_search function to generate an image aligned with the information about dogs from prior user conversations.

Experimental Evaluation

The empirical evaluation positions AUGUSTUS favorably against existing benchmarks. In ImageNet classification, AUGUSTUS delivered competitive top-1 accuracy, comparable to state-of-the-art vision models, demonstrating robustness in concept retrieval. Further tests on the Multi-Session Chat (MSC) benchmark evidenced AUGUSTUS's superior performance in sustaining conversation consistency and user context retention, as quantified by ROUGE-L scoring metrics.

Conclusion

AUGUSTUS marks a significant evolution in multimodal agent systems by aligning artificial memory with cognitive neuroscience principles. Its framework, emphasizing memory efficiency and retrieval augmentation, reflects a promising direction for AI systems requiring complex multimodal understanding and user-specific personalization. Future research could focus on expanding AUGUSTUS's cognitive capabilities, embedding learning features for system self-improvement, and further diversifying modality support to better emulate comprehensive human cognition systems.