LongPPL & LongCE Metrics for LLMs

Updated 17 February 2026

The paper introduces LongPPL and LongCE, which assess long-context performance by selectively evaluating context-sensitive tokens rather than averaging over all tokens.
It employs a long–short context contrast using LSD and LCL scores to identify and re-weight key tokens, ensuring more accurate evaluation and targeted fine-tuning.
Empirical results demonstrate strong negative correlations with standard metrics and significant gains on long-context benchmarks, despite modest increases in computational cost.

LongPPL and LongCE are evaluation metrics and loss re-weighting strategies designed to address limitations of conventional perplexity (PPL) and cross-entropy (CE) in modeling and training LLMs on long-context tasks. Standard PPL and CE metrics are inadequate in long-sequence regimes because they average over all tokens, yielding evaluations dominated by easy, context-agnostic tokens and failing to capture a model’s ability to leverage extended context. LongPPL and LongCE focus explicitly on context-sensitive (key) tokens, providing a more accurate assessment and more effective training protocol for long-context language modeling (Fang et al., 2024, Liu et al., 20 Mar 2025).

1. Motivation and Conceptual Foundation

The impetus for LongPPL and LongCE arises from the observation that standard perplexity (PPL) has almost no correlation with task-specific performance on long-sequence benchmarks such as LongBench, LongEval, and RULER. In natural language, only a small fraction of tokens genuinely require leveraging long-context information, while the majority can be predicted from local context alone. Standard PPL averages log-likelihoods over all tokens, so models may achieve low perplexity even if their long-context understanding is poor. This leads to a decoupling between PPL and real long-context performance, with empirical Pearson correlations often near zero between PPL and benchmark scores (Fang et al., 2024).

LongPPL addresses this by measuring perplexity only over “key tokens” identified as leveraging long-range information. LongCE extends this approach to training by up-weighting loss contributions from context-sensitive positions, directly incentivizing models to learn long-context dependencies.

2. Key Token Identification via Long–Short Context Contrast

The identification of context-sensitive (key) tokens is central to both metrics. This is operationalized using a long–short context contrast:

For each token $x_i$ in a sequence $(x_1, ..., x_n)$ , compute the long–short difference (LSD) score:

$\mathrm{LSD}_\theta(x_i) = \log P_\theta(x_i \mid x_{1:i-1}) - \log P_\theta(x_i \mid x_{i-K:i-1})$

where $K$ is the length of the short context window and $P_\theta$ is the model.

A token with a large $\mathrm{LSD}$ score is more predictable when conditioning on the full context, indicating reliance on long-range evidence.

Additionally, the long-context likelihood (LCL) is used to filter out tokens that remain difficult even with full context:

$\mathrm{LCL}_\theta(x_i) = \log P_\theta(x_i \mid x_{1:i-1})$

Tokens are selected using thresholds $(\alpha, \beta)$ :

$I(x_i;\theta_0) = \begin{cases} 1,& \mathrm{LSD}_{\theta_0}(x_i)>\alpha \;\text{ and }\; \mathrm{LCL}_{\theta_0}(x_i)>\beta \ 0,&\text{otherwise} \end{cases}$

where $\theta_0$ is the evaluator model.

This pseudo-binary mask $I$ is constructed by a two-pass process (long and short context) that can be performed offline (Fang et al., 2024).

3. Mathematical Definitions of LongPPL and LongCE

LongPPL

LongPPL evaluates a model $P_\theta$ only over the identified key tokens:

$\mathrm{LongPPL}(\mathbf x;\theta,\theta_0) = \exp\left(-\sum_{i=1}^n \hat I(x_i;\theta_0) \log P_\theta(x_i \mid x_{<i})\right)$

with

$\hat I(x_i;\theta_0) = \frac{I(x_i;\theta_0)}{\sum_{j=1}^n I(x_j;\theta_0)}$

ensuring uniform weighting among selected positions.

LongCE

LongCE generalizes the CE loss to re-weight contributions from each token according to a soft influence function:

$\mathrm{LongCE}(\mathbf x;\theta) = -\frac{1}{n}\sum_{i=1}^n I_{\rm soft}(x_i;\theta) \log P_\theta(x_i \mid x_{<i})$

where

$I_{\rm soft}(x_i;\theta) = \min\left(\exp(\mathrm{LSD}_\theta(x_i)),\,\gamma\right) = \min\left(\frac{P_\theta(x_i\mid x_{<i})}{P_\theta(x_i\mid x_{i-K:i-1})},\,\gamma\right)$

and $\gamma>1$ is a cap for numerical stability. This requires two forward passes per minibatch: (a) long context, and (b) short context. The model alternates between computing $I_{\rm soft}$ and optimizing the weighted loss (Fang et al., 2024).

4. Empirical Evaluation and Performance Correlation

Empirical studies highlight that, unlike standard PPL, LongPPL yields a strong, negative Pearson correlation with performance on long-context benchmarks. For instance, on the GovReport dataset with benchmark transfer to LongBench, LongEval, and RULER, standard PPL correlates with $r = -0.18$ , $+0.24$ , and $+0.27$ respectively, whereas LongPPL achieves $r = -0.96$ , $-0.90$ , and $-0.90$ . This metric is robust to the choice of evaluator model (Llama-3.1-8B, Qwen2-72B-Instruct, etc.) and insensitive to moderate changes in selection thresholds $(\alpha,\beta)$ . Omitting the LCL criterion or using the same model for evaluation and scoring collapses these correlations toward zero (Fang et al., 2024).

LongCE fine-tuning yields consistent gains on long-context tasks (e.g., LongEval accuracy +22 percentage points for Llama-2-7B, LongBench +1.3pp, RULER +7–9pp) with negligible degradation on short-context tasks (MMLU, ARC, etc.). These improvements are robust across datasets (PG-19, Pile-arxiv), base models (Mistral, Llama-2), and fine-tuning strategies (EABF, PI). Training time increases by 1.5–2× but can be mitigated by adjusting the sliding window step $d$ and short context length $K$ (Fang et al., 2024).

Metric	Key Feature	Correlation with Benchmarks
PPL	Average over all tokens	Near zero ( $r$ from –0.18 to +0.27)
LongPPL	Average over key tokens (LSD-selected)	Strong negative ( $r$ up to –0.96)
LongCE	Up-weighted loss on key tokens (soft)	Increased benchmark scores

5. Practical Implementation Considerations

LongPPL and LongCE do not introduce architectural changes or require specialized models; they are based on evaluation and loss re-weighting recipes compatible with any long-context LLM. For LongPPL, key tokens are precomputed offline using a capable long-context model as evaluator, such as Llama-3.1-8B. The sliding window trick amortizes computational cost for the short-context forward passes to $O((n-K)K^2/d)$ , where $d$ is the stride.

For LongCE, only two forward passes per batch are required, and the soft weights $I_{\rm soft}$ automatically adapt as the model improves, in a fashion reminiscent of EM-style bootstrapping. Hyperparameters $(K, d, \gamma)$ are stable across reasonable ranges. The codebase and detailed recipes are available at https://github.com/PKU-ML/LongPPL (Fang et al., 2024). There is no canonical library for these metrics embedded in widespread LLM frameworks; practical usage involves adapting standard scripts to mask or re-weight tokens according to the provided criteria (Liu et al., 20 Mar 2025).

6. Strengths, Limitations, and Diagnostical Value

The principal advantage of LongPPL and LongCE lies in their fidelity as diagnostic tools. By focusing on positions where long-range context yields measurable predictive benefit, they provide a truer gauge of a model’s effective context utilization capacity and avoid misleading reductions in perplexity driven solely by increased window size.

However, the methodology is contingent on appropriate selection of key tokens, itself relying on thresholds and even sliding-window heuristics. The metrics are diagnostic only—they do not alter model inductive bias by themselves. There is no requirement for specialized training data or model architectures, and their computational cost is moderate due to offline key-token selection and windowing tricks (Liu et al., 20 Mar 2025, Fang et al., 2024).

Aspect	LongPPL/LongCE	PPL/CE
Diagnostic	High fidelity to context-use	Inflated by context-irrelevant positions
Complexity	$O(N)$ , offline selection and windowing	$O(N)$
Modeling fix	No; diagnostic/auxiliary	No

7. Impact and Adoption in Long-Context Language Modeling

LongPPL and LongCE have prompted reconsideration of evaluation norms in long-context LLM research. Their empirical effectiveness in correlating with downstream task performance and guiding efficient fine-tuning substantiates the claim that genuine long-range modeling ability is manifested only in a concentrated subset of sequence positions. A plausible implication is that future LCLM evaluation and model design practices may systematically incorporate key-token-aware metrics and training objectives. These approaches are referenced in leading surveys and adopted within codebases supporting long-context benchmarks (e.g., RULER, LongBench), though no single canonical library exists (Liu et al., 20 Mar 2025, Fang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

What is Wrong with Perplexity for Long-context Language Modeling? (2024)

A Comprehensive Survey on Long Context Language Modeling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongPPL and LongCE.