Layered LRU (LLRU) Cache Policy

Updated 1 February 2026

Layered-LRU (LLRU) is a cache replacement policy designed for layered systems that enforce hierarchical constraints during admission and eviction.
In mixture-of-experts LLMs, LLRU minimizes cache misses by using recency measures that account for layer progression and future access patterns.
For layered data objects, LLRU enhances hit rates by promoting popular base layers and ensuring higher layers are cached only with their required lower layers.

Layered-LRU (LLRU) is a family of cache replacement policies designed for settings in which data objects, model parameters, or expert weights are naturally organized into layers or hierarchical versions. Unlike classic LRU, LLRU maintains and exploits explicit layer-awareness in both admission and eviction, and enforces hierarchical constraints where higher layers cannot be cached without their lower counterparts. LLRU has emerged in two independent lines: (1) for cache management in mixture-of-experts (MoE) LLMs, where it mitigates cache miss overhead arising from layered model architectures (Angelopoulos et al., 2 Sep 2025), and (2) in the caching of layered data objects in cloud and edge systems, such as multi-resolution media or neural model compression hierarchies (Bari et al., 1 Apr 2025). Across these domains, LLRU consistently achieves improved cache hit rates over classic paging policies, particularly when cache capacity is moderate and access patterns exhibit cross-layer correlations.

1. Layered-LRU in Mixture-of-Experts LLMs

Mixture-of-Experts LLMs contain $\ell$ transformer layers, each with $n$ available “experts” (dense subnetworks). At inference, exactly one expert per layer is selected per token, producing a deterministic and cyclical request pattern: the input sequence $\sigma = (p_1, ..., p_T)$ cycles through the $\ell$ layers repeatedly, where each $p_i\in L_{((i-1\mod \ell)+1)}$ references an expert in layer $j$ . Cache capacity is limited ( $k$ experts in fast memory); cache misses incur the overhead of loading from slow storage. The core problem is to store the most useful experts in cache so as to minimize miss-induced latency across rounds, with the strict constraint that each layer must always have a valid expert present per token (Angelopoulos et al., 2 Sep 2025).

Standard LRU is suboptimal in this context: by failing to account explicitly for the progression through layers, it may evict experts from imminent future layers, incurring predictable cache misses. LLRU rectifies this by integrating a layer-aware measure of recency: it tracks both the number of complete rounds since a page was last used and the relative layer distance between the current request and each cached page. On cache miss, LLRU evicts the page that has been unused for the most rounds, with ties broken in favor of pages from layers farthest in the future.

2. LLRU for Layered Data Objects

In cloud and edge systems, LLRU is applied to data objects stored in layered (multi-quality) representations, such as progressive images or scalable model weights. Each object $i$ can be requested at a specific version $v\in\{1,...,V\}$ ; serving a request for version $v$ requires layers $1,\dots,v$ to be present. Layer sizes may vary ( $s_{i,l}$ ), and per-layer and per-version requests have heterogeneous popularity ( $p_{i,l}$ , $p_{i,v}$ ). The LLRU policy maintains a global LRU list of cached layers and enforces the layering property: layer $\ell{+}1$ cannot be present without layers $1$ through $\ell$ . On a miss, each missing layer is fetched, and eviction proceeds with a scan for the least recently used “topmost” layer of any object, ensuring the hierarchy is consistently maintained (Bari et al., 1 Apr 2025).

This design enables effective sharing of capacity across versions and objects by aggressively promoting popular base layers and pruning higher layers of unpopular or large objects. The result is improved overall hit rate, particularly when lower layers are small and frequently requested, and higher layers are incrementally less popular or more costly.

3. Formal Algorithms and Data Structures

LLRU implementations in both settings use similar core data structures:

A doubly-linked list or balanced tree for global recency, where each node represents either an individual layer (layered objects) or a cached page (MoE expert).
A hash table for fast membership and pointer access, mapping the identity of each cached layer or expert to its position in the recency structure.
Auxiliary structures for tracking cache usage (bytes for data layers; page count for MoE).

The eviction process is distinguished by:

MoE LLRU: Calculating last-round index $R(q, t) = \left\lfloor \frac{t-\tau(q)}{\ell} \right\rfloor$ and relative layer distance $D(q, t) = (\tau(q) - (t\mod\ell))\mod\ell$ for each cached page at time $t$ . The page with lexicographically largest $(R, D)$ is evicted. Ties can be tuned by introducing a weighted parameter for $D$ (Angelopoulos et al., 2 Sep 2025).
Layered-object LLRU: On insertion/fetch, new and existing layers up to version $v$ are promoted to MRU. Eviction only removes the least-recent “topmost” (highest-index) layer among all objects, thus never fragmenting required base-layer hierarchies (Bari et al., 1 Apr 2025).

Algorithmic complexity is $O(1)$ per cache hit/miss for most manipulations, and $O(\log k)$ per eviction if a heap or tree is used to maintain eviction order keyed by multi-dimensional recency.

4. Theoretical Analysis and Performance Bounds

For MoE-LLM LLRU:

The competitive ratio (CR) for deterministic algorithms is bounded below by $k-\ell+1$ for any strategy, and is tight ( $k$ ) for LRU when $\ell$ divides $k+1$ .
For randomized algorithms, CR is at least $\max\{H_n, (\log \ell)/(6n)\}$ , where $H_n$ is the $n$ -th harmonic number (Angelopoulos et al., 2 Sep 2025).
Standard LRU can be pathological in layered settings, whereas LLRU achieves near-optimality by respecting inter-layer progression.

For layered-object LLRU:

The working-set approximation provides an exact analytical model (in the limit of large $D,V$ ): the cache’s “characteristic time” $T_C$ is found as the unique solution to $C = \sum_{i,\ell} s_{i,\ell}(1-e^{-\lambda_{i,\ell} t^*})$ , and per-layer hit probability is $1-e^{-\lambda_{i,\ell} t^*}$ (Bari et al., 1 Apr 2025).
The layering property induces nontrivial trade-offs: additional layers can help when their popularity is sufficiently high and size sufficiently low, but hurt otherwise (by reducing $t^*$ and crowding out more popular or lightweight layers).
The benefit of layering is nonmonotonic in both size and popularity distributions—a phenomenon numerically confirmed in several scenarios.

5. Empirical Evaluations and Comparisons

Comparative studies using real-world MoE LLM traces (Mixtral 7B with $(\ell=32, n=8)$ , Llama-MoE with $(\ell=32, n=16)$ ), synthetic Zipfian workloads, and layered-object simulations, establish robust dominance of LLRU over classic policies (Angelopoulos et al., 2 Sep 2025, Bari et al., 1 Apr 2025):

Algorithm	Llama-MoE (norm. faults)	Mixtral (norm. faults)
OPT	1.00 ± 0.00	1.00 ± 0.00
LRU	1.32 ± 0.15	1.25 ± 0.10
LRU-Dist	1.12 ± 0.05	1.08 ± 0.04
MARK	1.48 ± 0.20	1.35 ± 0.12
LLRU	1.07 ± 0.04	1.03 ± 0.03

LLRU reduces misses by approximately $15\%$ versus LRU and $7\%$ versus distributed-LRU (per-layer cache). It smooths the cache miss rate decay and avoids pathological eviction of imminent-layer pages. For layered objects, LLRU outperforms multi-representation LRU by up to $20-30\%$ in low-overhead regimes, especially when cache is limited. However, the advantage can be lost if overheads for layering are large and popularity is highly skewed.

6. Parameter Sensitivity and Extensions

LLRU efficacy and optimal parameterization rely on system and workload characteristics:

Cache size $k$ or capacity $C$ : improvement is most pronounced at moderate capacity, where cache pressure is significant but global optimization remains tractable.
Layer count $\ell$ and object counts $D$ , $V$ : the impact of extra layers is subtle. More layers can increase cache use efficiency if their fractional cost and popularity decay fast enough, but may degrade hit rates otherwise (Bari et al., 1 Apr 2025).
Tie-breaking and recency-weighting: For MoE LLRU, a generalized tie-breaker can tune eviction aggressiveness toward same-round pages, which is practically useful when layer access orders are not strictly periodic.
Extensions to multi-expert-per-layer selection or non-uniform expert sizes are treated through generalization of the round indices and possibly fractional offsets (Angelopoulos et al., 2 Sep 2025).

7. Analytical Modeling and Limit Theorems

The working-set analytical approximation used in layered-object LLRU provides provably asymptotically exact hit-rate predictions under the Independent Reference Model (IRM), both as $D\to\infty$ (fixed $V$ ) and $D,V\to\infty$ (continuum limit). The fixed-point equations governing characteristic time and per-layer hit probabilities also elucidate how hit rate depends jointly on layer size distribution, popularity profiles, and cache capacity (Bari et al., 1 Apr 2025). Non-monotonic and non-intuitive effects, such as adding an extra layer decreasing overall hit rate, are rigorously explained by the coupling induced by the layering constraint and the heavy-tailed or flat popularity profiles.

LLRU generalizes LRU for hierarchical, multi-layered caching scenarios—specifically, for both expert-parameter management in mixture-of-experts LLMs and versioned data object caching in distributed storage. It inherits LRU’s competitive guarantees and ensures better empirical performance by encoding inter-layer structure into recency calculations, validated through both theoretical analysis and large-scale empirical studies (Angelopoulos et al., 2 Sep 2025, Bari et al., 1 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Cache Management for Mixture-of-Experts LLMs -- extended version (2025)

Fundamentals of Caching Layered Data objects (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Layered-LRU (LLRU).