Attention-head properties underlying the recency effect
Identify the specific properties of transformer attention heads that support the recency effect—elevated output probability for more recent context tokens—in large language models such as Llama-3.1-8B, Mistral-7B, Qwen2.5-7B, and Gemma-2-9B, as observed in the temporal-dependency analyses and ablation experiments.
References
We leave for future work to explore what specific properties in attention heads might support the recency effect, commonly observed in humans and LLMs (manifested through high importance of more recent context ).
— Temporal Dependencies in In-Context Learning: The Role of Induction Heads
(2604.01094 - Bajaj et al., 1 Apr 2026) in Discussion