Intrinsic-property conjecture for human text probability

Establish whether the tendency of human-written text not to maximize per-token probability under left-to-right language models is an intrinsic property of human language.

Background

The authors show that natural text generally has lower per-token probability than text generated by beam search and tends not to remain in high-probability regions or fall into repetition loops. They hypothesize that this reflects communicative preferences (e.g., informativeness) rather than model shortcomings.

They conjecture that the property is inherent to human language and note that per-word learning objectives and left-to-right models without global text modeling may struggle to capture it, suggesting a need for theoretical clarification.

References

Why is human-written text not the most probable text? We conjecture that this is an intrinsic property of human language.

The Curious Case of Neural Text Degeneration  (1904.09751 - Holtzman et al., 2019) in Subsection "Natural Language Does Not Maximize Probability"