Origin of cross-genre differences in code length and conditional entropy

Ascertain whether the observed cross-genre differences in the dependence of code length L(N) and conditional entropy s(N) on context length N arise from inherent properties of the underlying texts in each genre or from each genre’s relationship to the training sets used to train the large language models employed for estimation.

Background

Across genres (e.g., internet corpus, Wikipedia, poetry), the authors find that code lengths and conditional entropies behave differently as functions of context length. Poetry shows longer code lengths and an apparent plateau, while other corpora show continued decline.

The authors explicitly state uncertainty about whether these differences reflect intrinsic properties of the genres themselves or are artifacts of how those genres were represented in the models’ training data, highlighting a key unresolved issue in interpreting LLM-based entropy measurements of language.

References

On the other hand we see significant differences across genres in the dependence of code length $L(N)$ and conditional entropy $s(N)$ on the context length $N$, and it is not clear whether this reflects inherent features of the real text or each genre's relationship to the models' training sets.

— Large language models and the entropy of English (2512.24969 - Scheibner et al., 31 Dec 2025) in Summary (final paragraph of main text)

Origin of cross-genre differences in code length and conditional entropy

Background

References

Related Problems