Quantifying personalized performance of memory‑equipped LLM agents in noisy real‑world settings

Develop rigorous, standardized methods to quantify the personalized performance of long‑term memory–equipped large language model agents under complex, noisy real‑world interaction scenarios, where user preferences evolve over time and interactions contain in‑session noise and linguistic variability.

Background

The paper argues that although recent architectures equip agents with structured and persistent memory, there is no widely accepted methodology for measuring how well these systems personalize to users in realistic conditions. Existing evaluations often reduce to static preference recall or needle‑in‑a‑haystack retrieval, which does not reflect event‑driven preference emergence and multi‑session dynamics.

PERMA is proposed to move toward this goal by introducing event‑driven, temporally ordered interactions with in‑session noise and style variation. Nonetheless, the authors explicitly identify the broader task of establishing methods for quantifying personalized performance in complex, noisy real‑world scenarios as an open challenge.

References

While these architectural innovations endow agents with the potential for long-range memory, the method for quantifying their personalized performance within complex, noisy real-world scenarios remains an open challenge.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments  (2603.23231 - Liu et al., 24 Mar 2026) in Introduction (Section 1)