Word-Adjacency Networks

Updated 17 January 2026

Word-adjacency networks are graph representations of text where nodes signify words and edges denote their sequential proximity.
They enable stylometric, semantic, and linguistic analyses by applying metrics such as clustering, centrality, and shortest-path lengths.
Their applications span authorship attribution, word sense disambiguation, and cross-linguistic comparisons, offering actionable insights into language structure.

A word-adjacency network (WAN) is a graph-theoretic representation of text in which nodes correspond to words (or sometimes a restricted subset of tokens), and edges encode adjacency relationships between words as they appear in running text. This modeling paradigm enables the quantitative analysis of higher-order structural, stylistic, and even semantic properties of language samples by leveraging tools from network science. WANs have been deployed extensively for stylometry, comparative linguistics, language evolution, word sense disambiguation, and semantic visualization; the resulting methodologies draw on a spectrum of complex network statistics, entropy-based motifs, and Markovian transition analyses.

1. Formal Definitions and Construction Variants

The core construction of a word-adjacency network varies according to linguistic objective and preprocessing strategy:

Simple word adjacency (content words): Each distinct lemmatized, non-stopword (content) word in the text constitutes a node. An undirected (or sometimes directed) edge is placed between any two nodes whose corresponding words are immediate neighbors in the cleaned, lemmatized word sequence. Edge weights may count the number of co-occurrences, though unweighted (binary) networks are common (Amancio, 2014, Stanisz et al., 2018).
Function-word adjacency (stylometry): Nodes correspond only to function words (e.g., articles, conjunctions, prepositions, quantifiers), capturing the author's habitual grammatical structuring independently of topic. Edges are directed and encode the likelihood that one function word is followed by another within a short window, often with an exponential decay applied according to token separation; this leads to a weighted adjacency matrix interpretable as a Markov chain transition matrix (Segarra et al., 2014).
Inclusion of punctuation and tokens: Expanding nodes to include punctuation marks allows modeling of their syntactic, prosodic, and emotional contributions and, in certain languages (notably Chinese), can decrease the global distances in the network, highlighting the importance of punctuation as structural "hubs" (Dec et al., 10 Jan 2026).
Directed versus undirected: For capturing word order (e.g., in content- or function-word adjacency), directed edges are employed, whereas undirected edges suffice when the interest is in mere co-occurrence without sequence specificity (Lahiri et al., 2013, Amancio, 2014).

Construction steps typically include lowercasing, sentence boundary handling, stopword removal or selection, lemmatization, and tokenization, followed by sliding a window of size 2 (or larger in some cases) to register edges in the adjacency matrix.

2. Topological and Local Network Measures

Once constructed, WANs can be characterized using both standard complex network measures and bespoke stylometric variants. These include:

Metric	Description	Canonical Source
Degree (k) & strength (s)	Number of immediate neighbors / sum of edge weights	(Amancio, 2014, Stanisz et al., 2018)
Clustering coefficient (C)	Probability that two neighbors of a node are connected	(Amancio, 2014, Lahiri et al., 2013)
Average shortest-path length (L)	Mean geodesic distance between all node pairs	(Kulig et al., 2014, Dec et al., 10 Jan 2026)
k-core (coreness)	Maximal subgraph where all nodes have degree ≥ k	(Lahiri et al., 2013)
Betweenness centrality (B)	Node's share of shortest paths between other node pairs	(Amancio, 2014, Lahiri et al., 2013)
Assortativity (r)	Correlation of degree at either end of an edge	(Lahiri et al., 2013, Stanisz et al., 2018)
Weighted clustering (C_w)	Barrat et al. coefficient incorporating co-occurrence strength	(Stanisz et al., 2018)

For stylometric tasks, local metrics—the degree, local clustering, and coreness of highly frequent or selected function words—outperform global metrics due to their direct relation to idiosyncratic authorial structure (Stanisz et al., 2018, Lahiri et al., 2013). Weighted versions are favored where text length and signal permit.

3. Markovian and Information-Theoretic Structure in Function-Word WANs

Function word WANs can be interpreted as Markov chains: the row-normalized adjacency matrix of the co-occurrence counts yields a transition matrix $P_c(f_i, f_j)$ for each author or text, with rows summing to unity. Authorship attribution proceeds by constructing a candidate's WAN and minimizing the Kullback-Leibler divergence (relative entropy):

$H(P_u \Vert P_c) = \sum_{i,j} \pi_u(f_i) P_u(f_i, f_j) \log\frac{P_u(f_i, f_j)}{P_c(f_i, f_j)}$

where $\pi_u$ is the stationary distribution for the test text's Markov chain (Segarra et al., 2014). This approach exploits short-range transition structure among function words, capturing stylometric signals orthogonal to pure frequency-based analysis. Empirical benchmarks show that WAN methods achieve higher accuracy: for example, binary attribution error rates dropped from 2.6% (naïve Bayes) or 2.7% (SVM) to 1.6% (WAN); in ten-way tasks, WAN error was 5.3% compared to 8.1% (Bayes) and 7.9% (SVM) (Segarra et al., 2014).

Parameter optimization (window size $D\sim10$ , discount factor $\alpha\sim0.75$ , function-word set size 40–70) is guided by maximizing attribution accuracy over held-out datasets.

4. Statistical Properties and Symmetry Motifs

Empirical WANs exhibit heavy-tailed or power-law-like distributions in both standard and symmetry-based motifs:

Degree, strength, and clustering distributions typically follow right-skewed, often scale-free profiles reflecting Zipfian word frequency and the dense “core” of frequent collocates (Amancio, 2014, Kulig et al., 2014).
Concentric or entropic symmetry: For each node and level $h$ , the entropy of a non-backtracking outward random walk quantifies how symmetrically a word’s multi-step neighborhood is reached. The “merged” symmetry metric displays a power-law-like tail, while the “backbone” symmetry is typically bimodal. These symmetry metrics are largely uncorrelated (|ρ| < 0.5) with degree or centrality, revealing independent stylistic dimensions (Amancio et al., 2015).
Small-world characteristics: Both English and Chinese WANs, when constructed with punctuation marks, display small-world properties with asymptotic average shortest-path length $L(N \to \infty) \approx 2.7 - 2.8$ (Dec et al., 10 Jan 2026).

5. Dynamic, Local, and Time-Varying Analyses

WANs exhibit stable statistical properties—even for networks built from short subtexts (a few thousand tokens)—enabling time-resolved or local analyses:

Stability under sampling: Most canonical topological descriptors (degree, clustering, path length, etc.) become stable for window sizes $W \gtrsim 1{,}500$ tokens, suggesting that stylometric or semantic analyses need not rely on full books but can operate on much shorter excerpts. This facilitates sliding-window or dynamic applications (Amancio, 2014).
Temporal WANs: By representing subtexts as consecutive WAN snapshots, one can monitor stylistic drift, topic transitions, or sentiment evolutions in a document (Amancio, 2014).

WAN topology extracted from short segments retains discriminative power: SVM-based authorship recognition achieved 86.7% accuracy on subtexts of 7,130 tokens, even outperforming whole-document WANs in some scenarios (Amancio, 2014).

6. Stylometric, Semantic, and Linguistic Applications

WANs underpin a variety of text-analytic tasks:

Authorship attribution: Leveraging local WAN metrics (on function words or frequent tokens), machine learning classifiers achieve state-of-the-art performance: >90% accuracy in 8-way attribution for both English and Polish texts using only a handful (n=12 for English) of normalized node-specific features (weighted degree and clustering coefficient) (Stanisz et al., 2018, Segarra et al., 2014).
Word sense disambiguation (WSD): Metrics such as hierarchical degree, hierarchical clustering, average shortest-path, and betweenness computed from local WAN structure around ambiguous word occurrences allow effective sense discrimination. In half of the tested polysemy cases, network-based features outperformed shallow neighbor-frequency baselines, with the most discriminative features typically being hierarchical (expanded-neighborhood) connectivity and clustering (Amancio et al., 2013).
Genre, era, and collaboration analysis: WAN-based distances (relative entropies) capture genre subclusters, historical distinctions, and mixed-style phenomena such as collaborative authorship or genre hybridization (Segarra et al., 2014).
Semantic word cloud layout: Geometric algorithms (e.g., Word Rectangle Adjacency Contact) formalize the problem of arranging words in space such that adjacency in the word cloud reflects network-encoded semantic relationships. The problem is NP-hard for general graphs, but polynomial-time algorithms exist for structured subclasses (e.g., quasi-triangulated planar graphs, hierarchical DAGs), and practical approximation algorithms can produce layouts superior to classical heuristics (Barth et al., 2013).

7. Cross-Linguistic Properties and Punctuation Effects

In studies of WANs across Chinese and English, the inclusion of punctuation tokens plays a pivotal role:

Incorporating punctuation produces WANs with nearly identical average shortest-path behavior across languages, homogenizing the ASPL curves and reducing path length in Chinese especially. Omitting punctuation markedly inflates path lengths (increase in $L_{max}$ by 4–6 in Chinese, 0.5–1 in English), with the effect significantly more pronounced in languages where punctuation structurally substitutes for explicit morphological or syntactic markers (Dec et al., 10 Jan 2026).
Punctuation marks function as network hubs or shortcuts, enhancing the “small-world” character and flattening historical or linguistic differences, especially in highly analytic languages (Dec et al., 10 Jan 2026).

A plausible implication is that stylometric or semantic analyses targeting language comparison, translation invariance, or genre identification should treat punctuation marks as structural words to preserve the intrinsic global navigability of language networks.

References

(Segarra et al., 2014) Authorship Attribution through Function Word Adjacency Networks
(Amancio et al., 2015) Concentric network symmetry grasps authors' styles in word adjacency networks
(Amancio, 2014) Probing the topological properties of complex networks modeling short written texts
(Stanisz et al., 2018) Linguistic data mining with complex networks: a stylometric-oriented approach
(Barth et al., 2013) On Semantic Word Cloud Representation
(Kulig et al., 2014) Modeling the average shortest path length in growth of word-adjacency networks
(Lahiri et al., 2013) Authorship Attribution Using Word Network Features
(Dec et al., 10 Jan 2026) Average shortest-path length in word-adjacency networks: Chinese versus English
(Amancio et al., 2013) Unveiling the relationship between complex networks metrics and word senses