Hilberg’s conjecture on vanishing code length at infinite context

Determine whether the mean next-character code length L(N) for English text, defined as the expected code length for predicting the next token given the previous N characters, tends to zero as N approaches infinity, as conjectured by Hilberg.

Background

The paper measures the mean code length L(N) for next-token prediction across very long contexts (up to ~104 characters) using several modern LLMs and multiple corpora. Empirically, L(N) continues to decrease with N without showing a clear plateau, especially for internet-scale corpora such as C4.

This observed long-range decrease is noted to be consistent with a long-standing conjecture originating with Hilberg, suggesting that L(N) might vanish as N→∞, which would imply extremely long-range structure in language and potentially sub-extensive entropy. The authors emphasize that their measurements do not settle the conjecture but are consistent with it at the explored scales.

References

This is consistent with the conjecture that $L(N \rightarrow \infty)$ might actually vanish , though the decay is much slower than one would estimate from data at smaller $N$.

Large language models and the entropy of English  (2512.24969 - Scheibner et al., 31 Dec 2025) in Main text, paragraph discussing Fig. 1 (paragraph beginning “Three of the four models agree almost perfectly…”)