Token-Level Graphs for Short Text Classification
- The paper introduces token-level graph methods that leverage GNNs to overcome short text sparsity and boost classification accuracy.
- It details diverse graph construction paradigms—including sliding-window, PMI-based, and syntactic graphs—with tailored token embeddings.
- Empirical results demonstrate significant performance gains over traditional models in tasks like sentiment analysis, NER, and dialogue disfluency detection.
Token-level graph methods for short text classification refer to a family of techniques that construct explicit or implicit graphs at the token (word or subword) level for each short text instance, leveraging graph neural networks (GNNs) or associated architectures to learn representations or augment feature spaces for improved classification accuracy. These methods address the severe sparsity, brevity, and context limitations inherent in short texts by exploiting relationships between tokens both within texts and across the corpus, often surpassing sequential models and classical feature-based approaches on standard benchmarks.
1. Formal Models of Token-Level Graphs
A token-level graph for a short text is typically defined as , where is the set of tokens (words, subwords, or syllables) identified in and is a set of edges encoding relationships among those tokens. Core construction paradigms include:
- Sliding-window graphs: Edges connect each token to its immediate neighbors using a window of size , with adjacency if ; self-loops are standard (Donabauer et al., 2024, Wang et al., 2023).
- PMI-based graphs: Edges are weighted by corpus-level or document-level pointwise mutual information, , to capture nonlocal lexical associations (Liu et al., 16 Jan 2025, Li et al., 2019, Wang et al., 2023).
- Fully-connected graphs: Every token is linked to every other, as in the fully-connected graphs in TextGraphFuseGAT (Nguyen, 13 Oct 2025).
- Syntactic/dependency/semantic graphs: Edges are derived from dependency trees or semantic similarity (e.g., cosine in embedding space) (Wang et al., 2023, Wang et al., 2021, Liu et al., 16 Jan 2025).
- Multi-source graphs: Graphs are constructed using various token-level sources—words, POS tags, named entities—each forming their own component subgraph (Liu et al., 16 Jan 2025, Wang et al., 2021).
- Corpus-level or heterogeneous graphs: Include nodes for words, documents, character n-grams, or even label nodes, creating inter- and intra-document edges (Li et al., 2022, Li et al., 2021).
The node features are derived from static embeddings (GloVe, Word2Vec), contextualized embeddings (BERT, PhoBERT, RoBERTa), or one-hot representations, often tailored to specific datasets or languages (Donabauer et al., 2024, Nguyen, 13 Oct 2025, Li et al., 2022).
2. Graph Neural Architectures for Token Graphs
The learning component operates on the token graph using GNN layers. Principal GNN variants include:
- Graph Attention Networks (GAT): Use learnable attention weights over neighbors; e.g., in TextGraphFuseGAT the update is
with multi-head concatenation to aggregate representations (Nguyen, 13 Oct 2025, Donabauer et al., 2024).
- Graph Convolutional Networks (GCN): Standard updates using normalized adjacency. For two layers:
where (Liu et al., 16 Jan 2025, Wang et al., 2021, Li et al., 2022).
- Gated or recursive propagation: The ReGNN model uses LSTM-style gates and global graph-level nodes to alleviate over-smoothing and propagate both local and nonlocal information effectively, especially in deeper networks (Li et al., 2019).
- Hybrid architectures: Integrate pretrained transformer encoders (e.g., BERT, PhoBERT) with GNNs for graph-based enhancement of contextual embeddings and further refinements via Transformer layers (Nguyen, 13 Oct 2025, Donabauer et al., 2024).
- Heterogeneous/multigraph GNNs: Separate architectures for word, POS, and entity graphs, pooled and fused at the document level, as in SHINE and MI-DELIGHT (Wang et al., 2021, Liu et al., 16 Jan 2025).
Pooling strategies for graph-level representations range from mean-pooling (node average), global attention, to hierarchical normalization and concatenation of multiple graph types (Donabauer et al., 2024, Wang et al., 2021, Liu et al., 16 Jan 2025).
3. Applications and Datasets
Token-level graph methods are particularly effective in sequence labeling and short-text classification. Representative applications include:
- Named Entity Recognition (NER), including domain-specific NER such as PhoNER-COVID19 and VietMed-NER for Vietnamese medical ASR (Nguyen, 13 Oct 2025).
- Dialogue disfluency detection on spontaneous speech (e.g., PhoDisfluency) (Nguyen, 13 Oct 2025).
- Sentiment and topic classification in microblogs, movie reviews, search snippets, and news headlines; datasets include Twitter Sentiment, MR, Snippets, TagMyNews (Donabauer et al., 2024, Liu et al., 16 Jan 2025, Wang et al., 2023, Li et al., 2022).
- Zero-/few-shot node classification: Methods like STAG quantize text-attributed graph nodes into discrete tokens compatible with LLMs, enabling both LLM-based and classical learning strategies (Bo et al., 20 Jul 2025).
The typical short-text regime involves sentences of length 6–40 tokens, highly imbalanced classes, and severe sparsity, where token-level graphs provide robustness and sample efficiency (Donabauer et al., 2024, Wang et al., 2021).
4. Empirical Performance and Ablation Insights
Token-level graph techniques consistently exceed or match non-graph and classical PLM baselines under low-resource and class-imbalanced conditions:
| Dataset/Task | Best Token-Graph Model | Accuracy / F1 | Notable Baseline | Accuracy / F1 |
|---|---|---|---|---|
| Twitter Sentiment | Token-Graph (Donabauer et al., 2024) | 0.837 (Acc/F1) | Fine-tuned BERT | 0.780 (Acc/F1) |
| MR (Movie Reviews) | Token-Graph (Donabauer et al., 2024) | 0.702 (Acc/F1) | BertGCN | 0.666 (Acc) |
| Vietnamese NER (word) | TextGraphFuseGAT (Nguyen, 13 Oct 2025) | 0.984 / 0.958 (F1) | PhoBERT_large | 0.945 / 0.931 (F1) |
| Short Texts (various) | MI-DELIGHT (Liu et al., 16 Jan 2025) | up to +4-6% ACC over SOTA | CNN/LSTM/BERT | varies |
| MR (TextGCN) | WCTextGCN (Li et al., 2022) | 77.85 (Acc) | BERT | 77.02 |
Ablations reveal that optimal token-graph GNNs are typically shallow (), low- for neighborhood ( or $2$), and leverage contextual or subword token embeddings. Adding dynamic or multi-source graph structures (e.g., PMI, POS, and entity) further improves performance, particularly on rare or minority classes (Nguyen, 13 Oct 2025, Liu et al., 16 Jan 2025, Li et al., 2022). In ablations, removal of graph structure or fusion blocks consistently leads to $2$–$16$ points drop in F1, with GAT/attention yielding largest marginal gains (Nguyen, 13 Oct 2025, Bo et al., 20 Jul 2025).
5. Extensions: Contrastive, Heterogeneous, and Multilingual Graphs
Recent advances explore contrastive learning over token graphs—instance-level and cluster-level contrastive objectives (e.g., MI-DELIGHT) maximize both augmentation invariance and cluster cohesion, further cementing graph methods' robustness in unlabeled and few-labeled settings (Liu et al., 16 Jan 2025, Ghosh et al., 2021). Hierarchical heterogeneous graphs integrate token, POS, entity, and document nodes, enabling better label propagation and knowledge integration (e.g., SHINE, LiGCN) (Wang et al., 2021, Li et al., 2021). Techniques such as quantization of token-level graph structure into discrete tokens make LLM-based downstream inference graph-compatible (e.g., STAG) (Bo et al., 20 Jul 2025).
Token graphs have been extended to multilingual and cross-domain settings, for example, TextGraphFuseGAT's integration of PhoBERT supports Vietnamese benchmarks with multi-domain (COVID, speech, medical) characteristics, and future work suggests linguistically-motivated sparse graphs for multilingual adaptation (Nguyen, 13 Oct 2025, Donabauer et al., 2024).
6. Limitations, Challenges, and Open Directions
- Graph construction overhead: PMI/co-occurrence statistics over large corpora are computationally intensive; fully-connected graphs raise memory and compute requirements (Nguyen, 13 Oct 2025, Donabauer et al., 2024).
- Oversmoothing in deep GNNs: Deep propagation can collapse token representations; gating (LSTM/GRU) and attention alleviate, but optimal depth is typically shallow (Li et al., 2019, Wang et al., 2023).
- Sparsity and rare tokens: Short texts yield small graphs; corpus-level graphs, character n-grams, or OOV-aware embeddings can mitigate, but boundary cases remain (Li et al., 2022, Wang et al., 2023).
- Edge semantics: Most methods rely on syntactic or heuristic co-occurrence; dynamic or learned edge types can improve performance but add complexity (Wang et al., 2023, Wang et al., 2021).
- Scalability: Token-level graphs scale linearly with text length, but batch processing and very long texts demand special pooling and batching mechanisms (Donabauer et al., 2024).
Open research directions include optimized or adaptive sparsity for token-level graphs, dynamic edge learning, multilingual/domain adaptation, explainability via interpretable token interactions, and tight integration with transformer self-attention (Nguyen, 13 Oct 2025, Bo et al., 20 Jul 2025, Wang et al., 2023).
7. Representative Models and Reproducibility
Prominent models with open-source implementations and strong performance in short-text token-level graph classification include:
| Model/Method | Key Features | arXiv ID |
|---|---|---|
| TextGraphFuseGAT | PhoBERT + fully-connected GAT | (Nguyen, 13 Oct 2025) |
| Token-Graph (BERT+GAT) | PLM embedding, window-1 GAT | (Donabauer et al., 2024) |
| MI-DELIGHT | Multi-source graph + hierarchical CL | (Liu et al., 16 Jan 2025) |
| SHINE | Heterogeneous, hierarchical GNN | (Wang et al., 2021) |
| ReGNN | Gated propagation, global node | (Li et al., 2019) |
| STAG | Graph quantization to tokens | (Bo et al., 20 Jul 2025) |
| WCTextGCN/WCTextGAT | Word and char n-gram graph | (Li et al., 2022) |
| ClassiNet | Feature-predictor graph, propagation | (Bollegala et al., 2018) |
These methods demonstrate the flexibility and robustness of token-level graph construction and modeling in handling the semantic sparsity, few-shot learning, and context limitations of short texts, with statistically significant improvements reported across diverse languages and domains.