Lexical Diversity Metrics Overview

Updated 20 January 2026

Lexical Diversity Metrics are measures that quantify the breadth, richness, and evenness of vocabulary in texts for diverse linguistic applications.
They encompass type–token ratios, windowed indices like MATTR/MTLD, probabilistic scores such as HD-D, and entropy-based methods to mitigate text-length biases.
Applications include corpus comparison, machine translation evaluation, and literary studies, providing robust insights into vocabulary variation across genres.

Lexical diversity metrics quantify the breadth, richness, and distributional properties of the vocabulary within a text or corpus. They play a central role in computational linguistics, applied linguistics, @@@@1@@@@ evaluation, synthetic text quality assessment, literary studies, and large-scale corpus analysis. The landscape of lexical diversity measurement is heterogeneous—spanning simple surface ratios, information-theoretic indices, ecological diversity measures, parametric growth-curve estimators, neural proxy-based metrics, and hybrid approaches that address persistent challenges such as text-length normalization and contextual complexity.

1. Core Families of Lexical Diversity Metrics

The principal approaches to lexical diversity fall into distinct but overlapping families of metrics, each with specific mathematical formalisms, interpretive logics, and practical strengths.

Type–Token Ratios and Variants: The type–token ratio (TTR), defined as $\mathrm{TTR} = V / N$ for $V$ unique word types and $N$ tokens, is foundational but exhibits a strong inverse dependency on text length (Rosillo-Rodes et al., 2024, Luis et al., 21 Nov 2025, Deshpande et al., 20 Jul 2025, Bestgen, 2023, Ploeger et al., 2024). Variants include:

Guiraud’s $R = V/\sqrt{N}$
Herdan’s $C = \ln V / \ln N$
Maas’s $a = (\ln N - \ln V)/(\ln N)^2$

These surface metrics are widely deployed for rapid screening within homogeneous-length corpora but are not considered robust for variable-length or large texts (Luis et al., 21 Nov 2025, Bestgen, 2023).

Windowed and Segmental Indices: To mitigate length sensitivity, metrics such as Moving-Average TTR (MATTR), Mean Segmental TTR (MSTTR), and the Measure of Textual Lexical Diversity (MTLD) segment the text or apply sliding windows, averaging TTR across fixed-size spans (Deshpande et al., 20 Jul 2025, Luis et al., 21 Nov 2025, Bestgen, 2023, Fu et al., 2021, Ploeger et al., 2024). MTLD is particularly robust: it computes the average span required for the running TTR to fall below a threshold (typ. 0.72):

$\mathrm{MTLD} = \frac{N_\text{tokens}}{S}$

where $S$ is the number of segments (or “factors”) needed.

Probabilistic and Statistical Indices: Probabilistic reduction indices such as HD-D compute the expected number of types in a random (without replacement) sample of size $n$ ,

$\mathrm{HD\text{-}D}(n) = \frac{1}{n} \sum_{i=1}^V \left[1 - \frac{\binom{N-n_i}{n}}{\binom{N}{n}}\right]$

with $n_i$ the frequency of type $i$ (Bestgen, 2023). Vocabulary growth curves (VOCD) estimate the rate $D$ in $V(N) \approx aN^D$ via regression on log–log plots (Luis et al., 21 Nov 2025). These approaches are favored for fine-grained, sample-size–controlled corpus comparisons.

Information-Theoretic and Ecological Indices: Word entropy and related Hill numbers generalize diversity by incorporating both richness (number of types) and evenness (frequency distribution):

Shannon entropy $H = -\sum_{i=1}^V p_i \log p_i$ ( $p_i$ the empirical probability) (Rosillo-Rodes et al., 2024)
Effective vocabulary size: $D^{[1]} = \exp(H)$ (Shannon or order-1 Hill number)
Simpson diversity: $D^{[2]} = 1/ \sum p_i^2$

These measures downweight rare types and provide more stable, interpretable quantities for large-scale and cross-genre/cross-linguistic analysis (Rosillo-Rodes et al., 2024, Carrasco et al., 2023).

Compression and Redundancy-Based Metrics: Metrics such as Compression Ratio (CR) and POS-compression apply lossless compression algorithms to token or part-of-speech sequences, yielding:

$\mathrm{CR}(x) = \frac{|C(x)|}{|x|}$

where $C(x)$ is the compressed representation in bytes/bits (Kambhatla et al., 23 May 2025, Deshpande et al., 20 Jul 2025). Higher compression implies lower diversity (more redundancy), but CR is also length-sensitive.

N-gram and Self-Repetition Scores: Lexical diversity is also probed via $n$ -gram diversity scores (NDS),

$\mathrm{NDS}(x) = \frac{1}{N} \sum_{k=1}^N \frac{V_k}{T_k}$

with $V_k$ the number of unique $k$ -grams, and self-repetition rates across multiple outputs (Kambhatla et al., 23 May 2025).

Expectation-Adjusted and Penalty-Based Metrics: Expectation-Adjusted Distinct (EAD-n) and Penalty-Adjusted Type-Token Ratio (PATTR) explicitly correct for length-induced bias by normalizing with the expected unique $n$ -grams (assuming a reference distribution) or penalizing deviation from a target sequence length (Liu et al., 2022, Deshpande et al., 20 Jul 2025):

$\mathrm{EAD\text{-}n} = \frac{D_n}{V_n [1 - (1-1/V_n)^{C_n}]}\qquad \mathrm{PATTR}(w;L_T) = \frac{|\text{set}(w)|}{|w| + |\,|w| - L_T\,|}$

Semantic and Conceptual Diversity Metrics: Metrics such as METEOR-based 1-diversity, synonym-type token ratio (SynTTR), and conceptual diversity via ontology-augmented entropy extend beyond surface overlap to quantifying semantic dispersion and underlying conceptual spread (Jayawardena et al., 2024, Ploeger et al., 2024, Phd et al., 2023).

Neural-Network Capacity Metrics: Recent work employs the minimal capacity of a trained autoencoder required to achieve acceptable reconstruction accuracy on the corpus as a dynamic, context-sensitive diversity proxy (Dang et al., 28 Feb 2025). This approach is sensitive to not only the size but also the structural and contextual diversity of the lexicon.

2. Mathematical Properties and Length Sensitivity

A foundational methodological concern is the strong inverse coupling of simple metrics such as TTR to sample length. As larger samples minimize the effect of hapax legomena and rare types, TTR converges toward zero at a rate determined by the Heaps’ law exponent ( $V \propto N^\beta; \beta<1$ ) (Rosillo-Rodes et al., 2024, Bestgen, 2023, Luis et al., 21 Nov 2025). Probabilistic (HD-D), windowed (MATTR), and thresholded (MTLD) methods effectively solve this “first length problem,” enabling direct cross-document comparison, though all reduction-based indices are sensitive to their own window or threshold parameters (“second length problem”) (Bestgen, 2023, Deshpande et al., 20 Jul 2025).

Table: Core Metrics and Their Length Dependency

Metric Family	Length Bias	Parameter Sensitivity
TTR, simple ratios	High	N/A
Probabilistic reductions (HD-D)	Low	Moderate (sample size)
Segmental (MATTR/MTLD)	Low	Moderate (window/thresh)
Entropy/Hill numbers	Lower	Minor (for large N)
Compression Ratio (CR)	High	Algorithmic parameters
EAD-n, PATTR	Very low/controllable	Reference/target length
Neural capacity	Invariant	Accuracy threshold

3. Metric Selection, Implementation, and Best Practices

Metric selection is task- and data-dependent. For document- or corpus-level screening among comparably sized texts, TTR and its lemma/POS-filtered variants (as in PUCP-Metrix) remain rapid indicators (Luis et al., 21 Nov 2025). For longitudinal or cross-length corpus analysis, HD-D, MATTR (with window size $n=50$ ), and MTLD (threshold $t=0.72$ ) are standard, with consistent preprocessing (tokenization, lemmatization, POS-tagging) essential for comparability (Bestgen, 2023, Luis et al., 21 Nov 2025, Deshpande et al., 20 Jul 2025). Hill numbers $D^{[k]}$ and entropy are preferred for large-scale, multi-register, or cross-lingual scenarios (Rosillo-Rodes et al., 2024, Carrasco et al., 2023). For open-domain text generation and synthetic data evaluation, use length- or expectation-adjusted metrics (EAD, PATTR) to avoid selection bias toward short outputs (Liu et al., 2022, Deshpande et al., 20 Jul 2025).

Pairing redundancy (CR) and variety (NDS), and including a cross-sample redundancy measure (self-repetition), is recommended to capture both internal and external diversity facets in multi-generation scenarios (Kambhatla et al., 23 May 2025). For literary and MT applications, multi-dimensional indices—including TTR, MTLD, synonym usage (PTF/CDU/SynTTR), and semantic-embedding similarity—provide robust insights into both surface and content-level diversity loss or recovery (Ploeger et al., 2024).

4. Multidimensional Perspectives and Empirical Correlates

State-of-the-art research recognizes lexical diversity as inherently multidimensional. For example, the six-dimensional schema of volume, abundance, variety-repetition (MATTR), evenness, disparity, and dispersion reveals that “diversity” reflects not only the breadth of types but also their distributional evenness and semantic spread. SVM-based studies demonstrate that these dimensions can reliably distinguish LLM-generated from human-written texts, even when controlling for length and lemmatization (Kendro et al., 31 Jul 2025).

Ecological indices (Hill numbers), entropy-TTR joint analysis, and lexicon growth-curve fitting collectively show that effective diversity encompasses both the introduction of rare vocabulary and the decay in type frequency variance (Rosillo-Rodes et al., 2024, Carrasco et al., 2023, Luis et al., 21 Nov 2025). For applied translation evaluation and generation tasks, semantic and synonym-based metrics complement n-gram-based ones by uncovering surface-level versus conceptual or paraphrastic diversity (Jayawardena et al., 2024, Ploeger et al., 2024).

5. Limitations, Robustness, and Future Directions

No single index provides a complete or context-invariant assessment of lexical diversity. Length normalization eliminates first-order bias but introduces parameter sensitivity. Segmental/windowed and probabilistic reductions must report and, where possible, sweep reduction parameters (e.g., window size for MATTR, threshold for MTLD) to ensure inferential stability (Bestgen, 2023, Deshpande et al., 20 Jul 2025). High-dimensional or conceptually enriched metrics—such as entropy over ontological expansions (conceptual diversity)—offer semantic depth but hinge on resource completeness and domain applicability (Phd et al., 2023).

Neural proxy-based metrics (autoencoder capacity) directly model the minimal representational complexity required for token reconstruction and integrate structural/contextual codependencies, but interpretation in linguistic terms and adaptation to multilingual or domain-specific scenarios remain open challenges (Dang et al., 28 Feb 2025).

Best practices in contemporary research recommend multi-metric triangulation: using TTR/MATTR/MTLD/HD-D for surface evaluation; Hill numbers and entropy for scale-invariance; adjusted metrics (EAD, PATTR) in synthetic/generative tasks; semantic metrics (METEOR, SynTTR) for content-level diversity; and capacity or conceptual indices for structural or ontological variety. Whenever possible, diversity assessments should be supplemented with external proxies (e.g., BLEU, ROUGE, entropy, Wasserstein distance) and contextualized with length, genre, and register information (Jayawardena et al., 2024, Deshpande et al., 20 Jul 2025, Kambhatla et al., 23 May 2025).

6. Applications and Cross-Domain Use Cases

Lexical diversity metrics underpin a range of scientific and applied NLP workflows:

Corpora Comparison and Historical Text Analysis: Quantification of vocabulary expansion/contraction, author or genre profiling, and diachronic linguistics via entropy-TTR mapping and Hill numbers (Rosillo-Rodes et al., 2024, Carrasco et al., 2023).
Synthetic Text and LLM Evaluation: Filtering and selection of diverse synthetic samples, LLM benchmarking, and prompt engineering using expectation- or penalty-adjusted metrics to control for prompt-induced length variability (Deshpande et al., 20 Jul 2025, Kambhatla et al., 23 May 2025, Kendro et al., 31 Jul 2025).
Machine Translation & Paraphrase Quality: Diagnostic assessment of diversity loss in neural MT, tailored recovery of literary diversity, and comparison of paraphrase quality through hybrid n-gram, synonym-aware, and distributional metrics (Ploeger et al., 2024, Jayawardena et al., 2024).
Authorship and Metadata Profiling: Author and catalog analysis in digital libraries with asymptotic Hill numbers and extrapolated Shannon diversity (Carrasco et al., 2023).
Language Education and Assessment: Automated proficiency and readability assessment, exploiting MATTR, MTLD, and advanced indices under strict length normalization (Luis et al., 21 Nov 2025, Bestgen, 2023).
Semantic Breadth Analysis: Novel conceptual diversity metrics to quantify generality and specificity in technical, expository, or open-domain texts (Phd et al., 2023).

Emerging directions include direct modeling of diversity using deep contextual encodings, integration of conceptual and structural ontology-based expansions, and the development of length/statistics–agnostic “dynamic” metrics capable of adapting to the demands of multi-lingual and domain-rich corpora (Dang et al., 28 Feb 2025, Rosillo-Rodes et al., 2024).

In summary, lexical diversity metrics constitute a structurally and semantically layered toolset for characterizing, comparing, and controlling the variation present in natural language. Their effective deployment requires awareness of underlying mathematical regularities, robust length normalization, multidimensionality, and complementary measurement paradigms, as well as context-sensitive metric selection based on analytic goals and text properties (Rosillo-Rodes et al., 2024, Deshpande et al., 20 Jul 2025, Bestgen, 2023, Luis et al., 21 Nov 2025, Carrasco et al., 2023, Liu et al., 2022, Dang et al., 28 Feb 2025, Kendro et al., 31 Jul 2025, Ploeger et al., 2024, Kambhatla et al., 23 May 2025, Jayawardena et al., 2024, Phd et al., 2023, Fu et al., 2021, Burchell et al., 2022).