Glottosets: Cross-Linguistic Lexical Sets

Updated 2 February 2026

Glottosets are structured cross-linguistic lexical collections compiled from multiple languages under shared conceptual or distributional criteria.
They are constructed using curated concept-based methods or automated corpus-driven pipelines, incorporating phonetic normalization and subword segmentation.
Glottosets facilitate empirical analysis of language relationships, lexical divergence, and typological patterns in a scalable and language-agnostic framework.

A glottoset is an abstracted cross-linguistic construct defined as a set or structured collection of linguistic items—words or subwords—compiled from multiple languages under a shared conceptual or distributional criterion. Glottosets are employed as foundational objects for scalable, automated comparative linguistics, facilitating the analysis of language relationships, lexical divergence, and typological patterns. There are two principal realizations in recent research: curated concept-based word lists for tradition-rooted phonetic comparisons, and large-scale, corpus-derived lexicons (augmented by corpus-statistical measures and subword segmentation) for data-driven macro-comparative applications (Mikulyte et al., 2020, &&&1&&&).

1. Formal Definition and Rationale

A glottoset for a language L is formally instantiated as either (a) a list of concept words with one entry per language for a fixed concept index (Mikulyte et al., 2020), or (b) a triple $G_L = (V_L, TF_L, DF_L)$ , where $V_L$ is the set of normalized word forms for language L, $TF_L$ is the term frequency mapping, and $DF_L$ is the document frequency mapping across a specified corpus (Wikipedia) (Chelombitko et al., 26 Jan 2026).

Concept-indexed realization: Each glottoset is a vector $w_{L,k}$ for language L and concept k, such as the “Numbers 1–10” glottoset: [wun, too, ...] for English, [un, de, ...] for French. This supports tradition-rooted computational comparative linguistics (CCL).
Corpus-derived realization: The glottoset is the full set of lexemes extracted from Wikipedia, with associated frequency statistics, operationalizing large-scale, script-filtered, genre-controlled lexical comparison.

The rationale behind these definitions is the need to model lexical similarity and divergence uniformly across hundreds or thousands of languages, minimizing dependence on manual expert curation and subjective selection biases. Both realizations capture phylogenetic and typological signals relevant for genealogical clustering, morphological analysis, and macro-linguistic exploration. A plausible implication is that glottosets provide a unified analytical interface for both expert-guided and fully automated studies.

2. Construction Protocols

2.1. Manual Concept-Based Glottosets

Word-set selection: Curated lists of concepts (e.g., numbers 1–10, colours, basic vocabulary) are compiled, one item per language per concept (Mikulyte et al., 2020).
Phonetic normalization: Words are mapped to a one-letter ASCII phoneme encoding, reducing graphemic and diacritic variation. Synonyms for a concept are represented as a set of phonetic strings, with minimum distance over all synonym pairs used for inter-lingual comparison.
Alternative encoding: Automatic IPA transcription can be used, but coverage remains limited.

2.2. Corpus-Driven Glottosets (Wikipedia Lexicon Extraction)

Corpus preparation: ZIM dumps for all Wikipedia editions are processed; paragraphs with fewer than 10 words are discarded.
Script filtering: Only Latin or Cyrillic paragraphs are retained. This controls for script-based typological grouping (205 Latin, 37 Cyrillic languages in the cited study) (Chelombitko et al., 26 Jan 2026).
Tokenization and statistics: Lowercase normalization and whitespace tokenization yield $V_L$ ; term and document frequencies are recorded for each type.

2.3. Subword Segmentation and Ranking

Byte-Pair Encoding (BPE): BPE segmentation is trained on each language or merged language sets, with the vocabulary size (number of merges) set to $K=4096$ (short-vocab variant). Adjacent symbol pairs are merged iteratively until the frequency threshold is met.
Rank-based subword vectors: For cross-lingual studies, subwords extracted from the universal tokenizer are ranked by frequency in each language, yielding high-dimensional representations for subsequent similarity analysis.

3. Quantitative Measures and Analytical Frameworks

3.1. Edit Distance and Its Variants

Levenshtein distance: Normalized edit distance between two phonetic strings, with optional substitution weights informed by historical sound laws (e.g., Grimm’s law).
Per-concept distance matrix: For each concept $k$ , construct $W^k$ with entries $W^k_{L_1,L_2}$ reflecting normalized edit distances.
Global language distance: Averaging per-concept matrices yields $D_{L_1,L_2}$ for downstream clustering.

3.2. Subword-Based Lexical Similarity

Jaccard distance: For languages A and B, $d_J(A,B) = 1 - \frac{|T_A \cap T_B|}{|T_A \cup T_B|}$ , where $T$ denotes subword vocabularies post-BPE.
Mantel test: Correlation between BPE-based distances and phylogenetic distances defined over Glottolog genealogies.
Homograph segmentation divergence: For homograph set $H$ , divergence is computed as $\delta(L_1,L_2) = \frac{|\{h \in H : S_{L_1}(h) \ne S_{L_2}(h)\}|}{|H|}$ .

3.3. Statistical and Cluster Analytics

Descriptive statistics: Per-concept mean ( $\mu_k$ ), standard deviation ( $\sigma_k$ ), and their product assess stability and variation.
Density estimation: Kernel-density plots of distance distributions facilitate subgroup detection.
Bhattacharyya coefficient: For two distributions $P$ and $Q$ (e.g., distances for “red” and “blue”), $BC(P,Q) = \sum_i \sqrt{P_i Q_i}$ quantifies overlap.
Hierarchical clustering: Agglomerative procedures (hclust in R), post hoc selection of cluster number via silhouette score maximization, assess genealogical purity and subgroup structure.

4. Representative Empirical Results

4.1. Concept-Based Analyses

Numbers (sheep-count glottoset): In 23 English dialects, “10” is most preserved ( $\mu_{10}=0.109$ , $\sigma_{10}=0.129$ ), “6” most variable ( $\mu_6=0.567$ , $\sigma_6=0.234$ ). Clustering recreates known Brittonic groupings. Geographical correlation is weak ( $R^2=0.131$ ) (Mikulyte et al., 2020).
Colours: Among 42 languages, Germanic “green” ( $\mu=0.168$ , $\sigma=0.129$ ) and “blue” ( $\mu=0.209$ , $\sigma=0.106$ ) are most preserved; Romance “green” ( $\mu=0.296$ , $\sigma=0.214$ ) dominates.
Encoding comparison: IPA and in-house lead to identical clade memberships, implying the in-house scheme is linguistically diagnostic.

4.2. Corpus-BPE Based Analyses

Morphological boundary (E2): BPE segmentation aligns with MorphyNet gold morpheme boundaries with an average F₁ of 0.34 (vs. 0.15 random baseline), indicating a 95% relative improvement.
Phylogenetic signal (E3): Mantel correlation $r_M = 0.329$ ( $p<0.001$ ). Romance cluster mean distance 0.51, Germanic inflated to 0.71 by English borrowings; between-family mean 0.82. Separation ratio 1.22× (t-test $p<10^{-13}$ ).
Homograph divergence (E4): In 26,939 cross-linguistic homographs, 48.7% are segmented differently among related languages; divergence rate scales with genealogical distance.

5. Workflow Integration and Scalability

Automated pipeline: Concept-based glottosets utilize Prolog for dynamic programming edit distance, Unix shell scripts for automation, and R for statistical analysis (CompLinguistics package: sMatrix, silhouetteV, hcutVisual, etc.). Corpus-based glottosets aggregate Wikipedia dumps and process BPE segmentations.
Scalability: With optimized algorithms, 3880 languages × 10 concepts complete in minutes on modern hardware (Prolog DP runs in $O(n·m)$ for word-pair); R analytics are fast up to thousands of items; shell scripts provide unattended end-to-end execution.
Extensibility: Both workflows can be augmented—concept lists for new families, or corpus expansion (e.g., Common Crawl extraction).

6. Limitations and Future Research Directions

Structural coverage: Edit distance models are symbol-centric, omitting prosody, tone, affixes. BPE primarily targets derivational morphology; inflectional patterns are underexplored.
Substitution heuristics: Phonetic substitution weights and BPE merge frequencies are heuristically set; cluster outcomes are sensitive to these parameters.
Corpus and script constraints: Wikipedia coverage varies, inducing corpus bias; strict script filtering may underrepresent cross-script cognates (e.g., Serbian vs. Croatian).
Cluster purity: “All-to-all” clustering with $K$ concepts produces impure clusters for broad glottosets, indicating optimal results on well-constrained language families.
Suggested expansions: Apply the glottoset+BPE protocol to wider web corpora, refine homograph analyses by token frequency, and integrate structural typology databases (WALS, Grambank) for multidimensional validation.

7. Macro-Linguistic Implications

Glottosets, as operationalized in contemporary computational comparative linguistics, bridge expert-curated cognate set traditions and fully automated corpus-driven macro-comparison. Frequency-driven subword segmentation (BPE) detects productive morpheme-like units, providing empirical interface for both typological and genealogical inference. The statistical infrastructure underlying glottosets enables language identification for previously unaddressed low-resource languages, with up to 81.5% accuracy in agglutinative isolates. These approaches establish glottosets as scalable, language-agnostic analytical primitives for exploring stability, divergence, and subgroup formation across hundreds of languages within controlled genre and script domains (Mikulyte et al., 2020, Chelombitko et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

An efficient automated data analytics approach to large scale computational comparative linguistics (2020)

Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Glottosets.