Multifractal analysis of sentence lengths in English literary texts

Published 13 Dec 2012 in physics.data-an, cs.CL, and physics.soc-ph | (1212.3171v1)

Abstract: This paper presents analysis of 30 literary texts written in English by different authors. For each text, there were created time series representing length of sentences in words and analyzed its fractal properties using two methods of multifractal analysis: MFDFA and WTMM. Both methods showed that there are texts which can be considered multifractal in this representation but a majority of texts are not multifractal or even not fractal at all. Out of 30 books, only a few have so-correlated lengths of consecutive sentences that the analyzed signals can be interpreted as real multifractals. An interesting direction for future investigations would be identifying what are the specific features which cause certain texts to be multifractal and other to be monofractal or even not fractal at all.

Abstract PDF Upgrade to Chat

Citations (9)

View on Semantic Scholar

Summary

The paper finds that genuine multifractality in sentence-length time series is rare in English literary texts.
It uses MFDFA and WTMM techniques to robustly measure scaling properties and validate multifractality against surrogate data.
The analysis highlights stylistic consistency among some authors while challenging the notion of universal fractal patterns in language.

Multifractal Analysis of Sentence Lengths in English Literary Texts

Introduction

The statistical analysis of natural language has a legacy of leveraging methodologies from information science, complex systems, and statistical physics. Despite an established interest in phenomena such as Zipf's law and the identification of long-range correlations in literary texts, the degree to which hierarchical and scaling properties in natural language exhibit multifractality remains underexplored. The paper "Multifractal analysis of sentence lengths in English literary texts" (1212.3171) systematically interrogates this issue using a robust corpus of English literary texts, focusing specifically on the temporal structure generated by sentence lengths.

Methodology

The investigation centers on 30 representative works from canonical English authors. Each text is transduced into a one-dimensional time series by counting the number of words between consecutive sentence-terminating punctuation. This representation emphasizes the sentence as the atomic informational unit of discourse.

Two independent methodologies are employed to assess multifractality:

Multifractal Detrended Fluctuation Analysis (MFDFA): This approach generalizes Hurst exponent estimation via fluctuation functions $F_q(n)$ at multiple scales and orders $q$ . A time series is classified as multifractal if the scaling exponent $h(q)$ is nonlinear with respect to $q$ .
Wavelet Transform Modulus Maxima (WTMM): As an auxiliary, cross-validating method, WTMM analyzes scaling properties in the time-scale plane and extracts the singularity spectrum $f(\alpha)$ . Concordance between the findings of MFDFA and WTMM strengthens the validity of multifractality detection.

The analysis carefully considers signal nonstationarity and potential spurious multifractality by comparison with surrogate (randomized) data.

Results

Empirical application to the 30-text corpus reveals heterogeneity in fractal characteristics:

Majority Non-Multifractal: Most texts do not exhibit genuine multifractal scaling in sentence-length time series. Many are either monofractal, bifractal, or lack fractal properties altogether.
Detection of Genuine Multifractals: Only a restricted subset of texts show robust multifractality, operationally defined by a sufficiently broad singularity spectrum $f(\alpha)$ and a sustained nonlinear dependence of $h(q)$ for a substantial scaling range. In these cases, autocorrelation analyses demonstrate power-law behavior, further implicating long-range dependencies.

An author-level analysis indicates stylistic invariance in some cases (e.g., works of Twain and Conan Doyle shared similar fractal signatures), while others (e.g., Austen) display intrapersonal stylistic diversity.

The outcome underscores that correlation structures in sentence lengths are not universally multifractal and that the origin of observed multifractality, when present, remains unexplained. The results do not indicate a universal linguistic principle but rather point to the complexity and variability inherent in language production and literary style.

Implications

Theoretical

The findings challenge the assumption that written language, when abstracted to sentence-length statistics, universally embodies multifractal structure. The rarity of multifractality in this domain implies that observed fractal characteristics in literary texts are contingent and may relate to deeper cognitive or stylistic generators not captured by simple metrics.

From a complex systems perspective, the evidence suggests that long-range correlations and scaling behavior in language are context- and representation-dependent. Future inquiries should aim at disentangling the contributions of narrative structure, authorial style, and possibly genre or cognitive constraints to the emergence of multifractality.

Practical

These results inform application domains such as stylometric analysis, natural language generation, and authorship attribution. The lack of ubiquitous multifractality diminishes the value of sentence-length multifractal features as universal discriminators but highlights their potential selectivity for certain authors or literary genres. Moreover, the applied methodology offers a template for further studies using more granular linguistic features (e.g., clause length, semantic segment size) or cross-linguistic corpora.

Future Directions

Key open questions involve the identification of text-intrinsic or author-intrinsic factors driving multifractal properties. A promising avenue is combining multifractality analysis with syntactic and semantic feature extraction or leveraging controlled manipulations of text to disentangle the effects of narrative form, editing style, and psychological factors.

Extending such analysis to multilingual or multimodal corpora and integrating these findings into more comprehensive models of linguistic complexity and cognitive representation would bridge gaps between quantitative linguistics and cognitive science.

Conclusion

The study rigorously interrogates fractal properties of sentence-length time series in English literary texts and finds that genuine multifractality is rare. The multifractal character, when present, is text-specific and cannot be ascribed to universal properties of the English language or written discourse at the sentence level. Theoretical and practical implications call for broader, multimodal investigations to uncover the generators of multifractal signatures in natural language (1212.3171).