Dual-Scale Memory Evaluation

Updated 16 January 2026

The paper outlines a dual-scale framework that distinguishes memory evaluation at low-level (device/kernel) and high-level (system/application) scales to identify critical performance bottlenecks.
It employs synthetic benchmarks and real hardware evaluations to measure key metrics such as accuracy, latency, and throughput across different memory regimes.
Implications include guiding design trade-offs in both hardware and software systems, with applications ranging from LLM-based agents to SoC memory instrumentation.

A dual-scale memory evaluation framework is an analytical and empirical methodology that systematically probes memory phenomena, mechanisms, or capabilities at at least two levels of abstraction (“scales”), such as low-level device or kernel attributes and high-level system, agent, or application behaviors. Such frameworks enable precise, interpretable measurements, reveal bottlenecks that emerge only through cross-scale dynamics, and guide engineering trade-offs in both hardware and software memory systems. The dual-scale paradigm has been instantiated in diverse domains, including LLM-based agent memory (Tan et al., 20 Jun 2025), memory system instrumentation (Poduval et al., 2024), heterogeneous SoC benchmarking (Ghaemi et al., 1 May 2025), cross-stack technology analysis (Pentecost et al., 2021), in-memory computing (Gao et al., 2019), and generative agent continual learning (Lin et al., 28 Nov 2025).

1. Fundamental Concepts and Taxonomy of Dual Scales

Dual-scale frameworks systematically distinguish between two memory regimes, which can be physical (e.g., fast on-chip vs. slow off-chip), cognitive (explicit fact vs. implicit summary), operational (kernel vs. system), or representational (short-term buffer vs. long-term parameter consolidation).

For instance, in LLM-based agent evaluation, factual memory refers to the storage and recall of atomic, explicit facts: $M_f(t) = \{ a \in A \mid \exists\,i \le t \text{ such that } d_i \text{ asserts } a \}$ whereas reflective memory is defined as a summary or profile synthesized from multiple facts: $R: 2^{M_f} \rightarrow S,\quad R(M_f) = s \in S$ where $S$ is the space of high-level summaries (e.g., user preferences) (Tan et al., 20 Jun 2025).

Similarly, in heterogeneous SoC benchmarking, the dual scale is articulated as benchmarking large, high-latency DRAM pools versus small, ultra-low-latency on-chip BRAM scratchpads (Ghaemi et al., 1 May 2025).

For agentic continual learning, the dual scale differentiates between short-term in-context trajectories (e.g., reasoning chains preserved over single prompt threads) and long-term parameter memory (gradually consolidated via on-the-fly fine-tuning of sustainable adapters) (Lin et al., 28 Nov 2025).

This structuring grounds the evaluation protocol, dataset construction, and metrics.

2. Dataset Construction and Benchmark Design

Effective dual-scale evaluation requires datasets and benchmarks that explicitly stress both memory scales, often via synthetic or programmatically generated workloads.

In MemBench, the corpus is constructed over 500 synthetic user relation graphs, each encoding both low-level facts (birthdates, event times) and high-level preferences (summary attributes). For each attribute, dialogues and multiple-choice probes are generated, then interleaved with noisy distractor sessions drawn from news corpora to control long-context difficulty. Four dataset splits arise by crossing factual vs reflective targets with two interaction modes: participation (active agent chat memory) and observation (passive transcript recording), yielding PS-FM, PS-RM, OS-FM, OS-RM (Tan et al., 20 Jun 2025).

In hardware evaluation, MemScope automatically discovers all available pools (off-chip DRAM, on-chip SRAM/BRAM) via device-tree parsing and creates allocator pools, synchronized stress scenarios, and assembly-level benchmarks to target each memory type’s unique access patterns and interference characteristics (Ghaemi et al., 1 May 2025).

Agentic frameworks adopting dual-scale memory, such as SuperIntelliAgent, employ an interleaved replay buffer that accumulates No→Yes refinement trajectories (short-term steps) and pipelines them for periodic adapter updates (long-term) (Lin et al., 28 Nov 2025).

3. Evaluation Protocols and Metrics

Dual-scale frameworks demand multi-faceted effectiveness, efficiency, and robustness metrics computed separately at each scale and under controlled scenario splits.

LLM/Agentic Memory (MemBench)

Effectiveness:
- Memory Accuracy: Fraction of correct answers over all probing questions.
- Recall@k: Proportion of queries where the ground-truth memory entry is in the top-k retrieved items.
Efficiency:
- Read Time (RT) and Write Time (WT): Average time per memory operation, reflecting cost of memory module access at each scale.
Capacity:
- Empirically, the number of tokens stored $N$ before memory accuracy declines beyond a threshold:
$\text{Find } C : \text{acc}(N) - \text{acc}(N+\Delta) > \varepsilon$

where typically $\Delta$ and $\varepsilon$ (e.g., 5%) are pre-set.

Embedded/SoC Memory (MemScope)

Bandwidth: Bytes transferred per unit time, derived from precisely timed assembly microbenchmarks run under isolated or multi-core stress.
Latency: Pointer-chasing average nanoseconds per access; stalling hardware prefetch and systematically invalidating caches.
Noise Resilience: Standard deviation and tail latency over 500+ iterations.
Scalability: Runs repeated with $0\dots(p-1)$ concurrent stressor cores; compositional sensitivity to contention is mapped for each memory pool.

Agentic Replay/Continual Learning (SuperIntelliAgent)

Short-term retention: Measured by decrease in required refinement steps (T) per prompt over time.
Long-term retention: Improvement in 1st-step accuracy after LoRA adapter updates and replay.
Sample Efficiency: Fraction of prompts yielding DPO pairs.
Forgetting: Can be traced via re-evaluating previously solved prompts after multiple update cycles.

4. Task and Scenario Taxonomy

A rigorous dual-scale evaluation decomposes tasks by both memory scale and scenario.

In MemBench:

Factual tasks: single-hop (direct recall), multi-hop (entailment from multiple facts), comparative, aggregative, knowledge-updating, and both single/multi-session assistant recall.
Reflective tasks: summary/inference of user preferences, emotional state derivation over distributed evidence.
Interaction modes:
- Participation: Agent actively interacts, requiring recall of self-generated as well as user-generated content.
- Observation: Agent passively stores and reads, isolating pure read/write path evaluation absent reasoning.

Each (scale × scenario) combination is validated with temporally controlled simulation and fixed-size subdatasets for both "small" (10k tokens) and "large" (100k tokens) contexts (Tan et al., 20 Jun 2025).

In embedded SoCs, scenarios include controlled variation of stressor core count and per-core access pattern, and each memory pool is profiled in isolation and under mixed contention (Ghaemi et al., 1 May 2025).

5. Mechanism Plug-in and Comparative Analysis

Dual-scale memory frameworks support benchmarking of diverse memory mechanisms and facilitate comparative analysis of their scaling properties.

MemBench pipelines multiple mechanisms—FullMemory (retain all turns), RecentMemory (sliding window), RetrievalMemory (vector-store + kNN), GenerativeAgent, MemoryBank, MemGPT, Self-Controlled Memory (SCMemory)—into a unified simulation harness (MemEngine). Each mechanism’s trade-off is tracked via memory-accuracy, capacity curves, and efficiency statistics, revealing, for example, that retrieval-based memory scales well at cost of slower reads; naively recent memory is fast but forgets (Tan et al., 20 Jun 2025).

MemScope allows direct comparison of DRAM (e.g., PS-DRAM vs PL-DRAM), on-chip SRAM, and BRAM pools under identical stress harnesses. Write, read, and pointer-chase latencies, bandwidth curves, and interference sensitivity are compared side-by-side, establishing, for example, the superior interference resilience of on-chip scratchpad memories for real-time workloads (Ghaemi et al., 1 May 2025).

In generative agent continual learning, ablation studies without replay, without LoRA fine-tuning, or without iterative refinement quantify the specific contribution of each memory layer (Lin et al., 28 Nov 2025).

6. Aggregation, Visualization, and Interpretation

Results are aggregated and visualized by averaging metrics across question types, memory scales, interaction modes, and dataset sizes, and then plotted (e.g., accuracy vs. memory size) to reveal robustness and degradation inflection points.

In MemBench, capacity plots demonstrate critical memory thresholds; efficiency statistics delineate throughput trade-offs, guiding mechanism selection for deployment (Tan et al., 20 Jun 2025). In MemScope, boxplots and timeline views illustrate per-run performance and noise; per-core breakdowns quantitatively capture contention effects (Ghaemi et al., 1 May 2025).

Tables summarizing scenario/task splits or metric definitions provide clarity on evaluation coverage.

Framework	Lower Scale	Upper Scale
MemBench	Factual memory	Reflective memory
MemScope	BRAM/On-chip SRAM	DRAM/fabric-side memory
SuperIntelliAgent	Short-term (context buffer)	Long-term (adapter parameters)

Interpretation centers on identifying which mechanisms or architectures maintain robustness as memory load, interference, or reasoning demand increases, and diagnosing trade-offs between access speed, retention, and compositional generalization.

7. Limitations and Extensibility

Dual-scale frameworks do not eliminate the need for careful scenario design: e.g., MemBench is presently limited to synthetic user graphs and multiple-choice questions; MemScope, while generalizable via DTB discovery, still only captures memory phenomena detectable at kernel level (Ghaemi et al., 1 May 2025, Tan et al., 20 Jun 2025). Emergent reasoning tasks or multistep agentic workflows may require extension for richer long-term dependencies.

Nevertheless, dual-scale paradigms inherently facilitate extensibility. Examples include augmenting static analysis passes with probabilistic stride or prefetch models (Poduval et al., 2024), generalizing memory pool detection to new device classes, and integrating new memory mechanisms or adapter types. Extending to other modalities (vision, code, etc.) is conceptually straightforward via suitable definition of memory primitives at each scale (Lin et al., 28 Nov 2025).

References

MemBench: "MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents" (Tan et al., 20 Jun 2025)
MemScope: "Heterogeneous Memory Benchmarking Toolkit" (Ghaemi et al., 1 May 2025)
Examem: "Examem: Low-Overhead Memory Instrumentation for Intelligent Memory Systems" (Poduval et al., 2024)
SuperIntelliAgent: "Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent" (Lin et al., 28 Nov 2025)
NVMExplorer: "NVMExplorer: A Framework for Cross-Stack Comparisons of Embedded Non-Volatile Memories" (Pentecost et al., 2021)
Eva-CiM: "Eva-CiM: A System-Level Performance and Energy Evaluation Framework for Computing-in-Memory Architectures" (Gao et al., 2019)