Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models

Published 22 May 2024 in cs.CL, cs.AI, cs.IT, and math.IT | (2405.13798v3)

Abstract: We prove a new asymptotic equipartition property for the perplexity of long texts generated by a LLM and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a LLM must asymptotically converge to the average entropy of its token distributions. This defines a "typical set" that all long synthetic texts generated by a LLM must belong to. We show that this typical set is a vanishingly small subset of all possible grammatically correct outputs. These results suggest possible applications to important practical problems such as (a) detecting synthetic AI-generated text, and (b) testing whether a text was used to train a LLM. We make no simplifying assumptions (such as stationarity) about the statistics of LLM outputs, and therefore our results are directly applicable to practical real-world models without any approximations.