Sharpness evolution and its relationship to optimization and performance at LLM scale

Determine how loss landscape sharpness in Large Language Models evolves during large-scale training and ascertain its relationship to optimization behavior and downstream performance across tasks and data distributions.

Background

The paper emphasizes that measuring Hessian sharpness directly is impractical for LLMs due to computational constraints, which has limited prior studies to small-scale models. This limitation has left gaps in understanding how sharpness behaves at realistic LLM scales and how it interacts with optimization processes and downstream performance.

To address measurement challenges, the authors analyze critical sharpness as a scalable proxy and present empirical evidence at up to 7B parameters. However, the general characterization of sharpness evolution and its systematic relationship to optimization and task performance remains framed as an open question in the introduction.

References

As a result, most existing studies are restricted to small-scale experiments (typically ∼ 10M parameters), leaving open questions about how sharpness evolves in LLMs at scale, and how it relates to optimization and downstream performance.

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs  (2601.16979 - Kalra et al., 23 Jan 2026) in Section 1 (Introduction), final paragraph