Papers
Topics
Authors
Recent
Search
2000 character limit reached

LongPPL & LongCE Metrics for LLMs

Updated 17 February 2026
  • The paper introduces LongPPL and LongCE, which assess long-context performance by selectively evaluating context-sensitive tokens rather than averaging over all tokens.
  • It employs a long–short context contrast using LSD and LCL scores to identify and re-weight key tokens, ensuring more accurate evaluation and targeted fine-tuning.
  • Empirical results demonstrate strong negative correlations with standard metrics and significant gains on long-context benchmarks, despite modest increases in computational cost.

LongPPL and LongCE are evaluation metrics and loss re-weighting strategies designed to address limitations of conventional perplexity (PPL) and cross-entropy (CE) in modeling and training LLMs on long-context tasks. Standard PPL and CE metrics are inadequate in long-sequence regimes because they average over all tokens, yielding evaluations dominated by easy, context-agnostic tokens and failing to capture a model’s ability to leverage extended context. LongPPL and LongCE focus explicitly on context-sensitive (key) tokens, providing a more accurate assessment and more effective training protocol for long-context language modeling (Fang et al., 2024, Liu et al., 20 Mar 2025).

1. Motivation and Conceptual Foundation

The impetus for LongPPL and LongCE arises from the observation that standard perplexity (PPL) has almost no correlation with task-specific performance on long-sequence benchmarks such as LongBench, LongEval, and RULER. In natural language, only a small fraction of tokens genuinely require leveraging long-context information, while the majority can be predicted from local context alone. Standard PPL averages log-likelihoods over all tokens, so models may achieve low perplexity even if their long-context understanding is poor. This leads to a decoupling between PPL and real long-context performance, with empirical Pearson correlations often near zero between PPL and benchmark scores (Fang et al., 2024).

LongPPL addresses this by measuring perplexity only over “key tokens” identified as leveraging long-range information. LongCE extends this approach to training by up-weighting loss contributions from context-sensitive positions, directly incentivizing models to learn long-context dependencies.

2. Key Token Identification via Long–Short Context Contrast

The identification of context-sensitive (key) tokens is central to both metrics. This is operationalized using a long–short context contrast:

  • For each token xix_i in a sequence (x1,...,xn)(x_1, ..., x_n), compute the long–short difference (LSD) score:

LSDθ(xi)=logPθ(xix1:i1)logPθ(xixiK:i1)\mathrm{LSD}_\theta(x_i) = \log P_\theta(x_i \mid x_{1:i-1}) - \log P_\theta(x_i \mid x_{i-K:i-1})

where KK is the length of the short context window and PθP_\theta is the model.

A token with a large LSD\mathrm{LSD} score is more predictable when conditioning on the full context, indicating reliance on long-range evidence.

  • Additionally, the long-context likelihood (LCL) is used to filter out tokens that remain difficult even with full context:

LCLθ(xi)=logPθ(xix1:i1)\mathrm{LCL}_\theta(x_i) = \log P_\theta(x_i \mid x_{1:i-1})

Tokens are selected using thresholds (α,β)(\alpha, \beta):

I(xi;θ0)={1,LSDθ0(xi)>α   and   LCLθ0(xi)>β 0,otherwiseI(x_i;\theta_0) = \begin{cases} 1,& \mathrm{LSD}_{\theta_0}(x_i)>\alpha \;\text{ and }\; \mathrm{LCL}_{\theta_0}(x_i)>\beta \ 0,&\text{otherwise} \end{cases}

where θ0\theta_0 is the evaluator model.

This pseudo-binary mask II is constructed by a two-pass process (long and short context) that can be performed offline (Fang et al., 2024).

3. Mathematical Definitions of LongPPL and LongCE

LongPPL

LongPPL evaluates a model PθP_\theta only over the identified key tokens:

LongPPL(x;θ,θ0)=exp(i=1nI^(xi;θ0)logPθ(xix<i))\mathrm{LongPPL}(\mathbf x;\theta,\theta_0) = \exp\left(-\sum_{i=1}^n \hat I(x_i;\theta_0) \log P_\theta(x_i \mid x_{<i})\right)

with

I^(xi;θ0)=I(xi;θ0)j=1nI(xj;θ0)\hat I(x_i;\theta_0) = \frac{I(x_i;\theta_0)}{\sum_{j=1}^n I(x_j;\theta_0)}

ensuring uniform weighting among selected positions.

LongCE

LongCE generalizes the CE loss to re-weight contributions from each token according to a soft influence function:

LongCE(x;θ)=1ni=1nIsoft(xi;θ)logPθ(xix<i)\mathrm{LongCE}(\mathbf x;\theta) = -\frac{1}{n}\sum_{i=1}^n I_{\rm soft}(x_i;\theta) \log P_\theta(x_i \mid x_{<i})

where

Isoft(xi;θ)=min(exp(LSDθ(xi)),γ)=min(Pθ(xix<i)Pθ(xixiK:i1),γ)I_{\rm soft}(x_i;\theta) = \min\left(\exp(\mathrm{LSD}_\theta(x_i)),\,\gamma\right) = \min\left(\frac{P_\theta(x_i\mid x_{<i})}{P_\theta(x_i\mid x_{i-K:i-1})},\,\gamma\right)

and γ>1\gamma>1 is a cap for numerical stability. This requires two forward passes per minibatch: (a) long context, and (b) short context. The model alternates between computing IsoftI_{\rm soft} and optimizing the weighted loss (Fang et al., 2024).

4. Empirical Evaluation and Performance Correlation

Empirical studies highlight that, unlike standard PPL, LongPPL yields a strong, negative Pearson correlation with performance on long-context benchmarks. For instance, on the GovReport dataset with benchmark transfer to LongBench, LongEval, and RULER, standard PPL correlates with r=0.18r = -0.18, +0.24+0.24, and +0.27+0.27 respectively, whereas LongPPL achieves r=0.96r = -0.96, 0.90-0.90, and 0.90-0.90. This metric is robust to the choice of evaluator model (Llama-3.1-8B, Qwen2-72B-Instruct, etc.) and insensitive to moderate changes in selection thresholds (α,β)(\alpha,\beta). Omitting the LCL criterion or using the same model for evaluation and scoring collapses these correlations toward zero (Fang et al., 2024).

LongCE fine-tuning yields consistent gains on long-context tasks (e.g., LongEval accuracy +22 percentage points for Llama-2-7B, LongBench +1.3pp, RULER +7–9pp) with negligible degradation on short-context tasks (MMLU, ARC, etc.). These improvements are robust across datasets (PG-19, Pile-arxiv), base models (Mistral, Llama-2), and fine-tuning strategies (EABF, PI). Training time increases by 1.5–2× but can be mitigated by adjusting the sliding window step dd and short context length KK (Fang et al., 2024).

Metric Key Feature Correlation with Benchmarks
PPL Average over all tokens Near zero (rr from –0.18 to +0.27)
LongPPL Average over key tokens (LSD-selected) Strong negative (rr up to –0.96)
LongCE Up-weighted loss on key tokens (soft) Increased benchmark scores

5. Practical Implementation Considerations

LongPPL and LongCE do not introduce architectural changes or require specialized models; they are based on evaluation and loss re-weighting recipes compatible with any long-context LLM. For LongPPL, key tokens are precomputed offline using a capable long-context model as evaluator, such as Llama-3.1-8B. The sliding window trick amortizes computational cost for the short-context forward passes to O((nK)K2/d)O((n-K)K^2/d), where dd is the stride.

For LongCE, only two forward passes per batch are required, and the soft weights IsoftI_{\rm soft} automatically adapt as the model improves, in a fashion reminiscent of EM-style bootstrapping. Hyperparameters (K,d,γ)(K, d, \gamma) are stable across reasonable ranges. The codebase and detailed recipes are available at https://github.com/PKU-ML/LongPPL (Fang et al., 2024). There is no canonical library for these metrics embedded in widespread LLM frameworks; practical usage involves adapting standard scripts to mask or re-weight tokens according to the provided criteria (Liu et al., 20 Mar 2025).

6. Strengths, Limitations, and Diagnostical Value

The principal advantage of LongPPL and LongCE lies in their fidelity as diagnostic tools. By focusing on positions where long-range context yields measurable predictive benefit, they provide a truer gauge of a model’s effective context utilization capacity and avoid misleading reductions in perplexity driven solely by increased window size.

However, the methodology is contingent on appropriate selection of key tokens, itself relying on thresholds and even sliding-window heuristics. The metrics are diagnostic only—they do not alter model inductive bias by themselves. There is no requirement for specialized training data or model architectures, and their computational cost is moderate due to offline key-token selection and windowing tricks (Liu et al., 20 Mar 2025, Fang et al., 2024).

Aspect LongPPL/LongCE PPL/CE
Diagnostic High fidelity to context-use Inflated by context-irrelevant positions
Complexity O(N)O(N), offline selection and windowing O(N)O(N)
Modeling fix No; diagnostic/auxiliary No

7. Impact and Adoption in Long-Context Language Modeling

LongPPL and LongCE have prompted reconsideration of evaluation norms in long-context LLM research. Their empirical effectiveness in correlating with downstream task performance and guiding efficient fine-tuning substantiates the claim that genuine long-range modeling ability is manifested only in a concentrated subset of sequence positions. A plausible implication is that future LCLM evaluation and model design practices may systematically incorporate key-token-aware metrics and training objectives. These approaches are referenced in leading surveys and adopted within codebases supporting long-context benchmarks (e.g., RULER, LongBench), though no single canonical library exists (Liu et al., 20 Mar 2025, Fang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongPPL and LongCE.