Attention-head properties underlying the recency effect

Identify the specific properties of transformer attention heads that support the recency effect—elevated output probability for more recent context tokens—in large language models such as Llama-3.1-8B, Mistral-7B, Qwen2.5-7B, and Gemma-2-9B, as observed in the temporal-dependency analyses and ablation experiments.

Background

The paper investigates temporal dependencies in in-context learning using a free-recall-inspired paradigm and finds a strong +1 lag bias (serial-recall-like behavior) across several open-weight models. Ablation of induction heads reduces this bias and often increases recency effects, suggesting distinct mechanisms for successor retrieval and recency.

In the Discussion, the authors explicitly note that while recency is commonly observed in both humans and LLMs, the specific attention-head properties that give rise to this effect remain to be determined, and they defer this question to future work.

References

We leave for future work to explore what specific properties in attention heads might support the recency effect, commonly observed in humans and LLMs (manifested through high importance of more recent context ).

— Temporal Dependencies in In-Context Learning: The Role of Induction Heads (2604.01094 - Bajaj et al., 1 Apr 2026) in Discussion

Attention-head properties underlying the recency effect

Background

References

Related Problems