Token-Level Relational Structures

Updated 6 February 2026

Token-Level Relational Structures are fine-grained frameworks that model interactions between individual tokens via graph representations.
They enhance performance in tasks like visual distillation and token-level classification by aligning local and global dependencies.
Empirical studies demonstrate improvements in accuracy and robustness across imbalanced, long-tailed, and low-resource scenarios.

Token-level relational structures are mathematical and computational frameworks that explicitly encode, learn, and exploit pairwise or higher-order relationships between individual tokens within an input sequence or spatial patch-based representation. Unlike coarse instance-level graphs whose nodes correspond to entire input entities, token-level relational structures operate at much finer granularity, enabling models to capture nuanced semantic, syntactic, or visual relationships—both within a single instance (inner-instance) and across multiple instances in a batch. This paradigm has significant impact in tasks such as knowledge distillation for visual models and token-level classification in natural language processing.

1. Definition and Motivation

Token-level relational structures formalize interactions between individual units (tokens, e.g., sub-words, image patches) by representing them as nodes in a graph, with edges modeling potential semantic or contextual dependencies. This finer granularity stands in contrast to methods that focus solely on individual token outputs (e.g., softmax logits) or to instance-level relational graphs that aggregate over entire image or document embeddings.

In knowledge distillation, token-level relational graphs allow the transfer of fine-grained information from a teacher to a student, propagating semantic similarity between local regions or co-occurring patterns ("fur" between cats and dogs) even in heavily imbalanced, long-tail datasets (Zhang et al., 2023). In sequence labeling tasks, such as named entity recognition or disfluency detection, integrating token-level relational graphs enables models to infer both local and global dependencies independent of strict sequential bias (Nguyen, 13 Oct 2025).

2. Construction of Token-Level Graphs

Token-level relational graphs can be constructed according to a range of design choices:

Patch/Token Extraction: Inputs (images, sentences) are decomposed into tokens (non-overlapping patches for images, sub-word tokens for text), each represented as a vector $T_i$ or $h_i$ .
Graph Topology:
- k-NN Graphs (visual): In the Token-level Relationship Graph (TRG) paradigm, nodes represent randomly sampled patch tokens across images; edges are created only between each token and its $k$ nearest neighbors based on Gaussian-weighted Euclidean distance:
$A_{ij} = \begin{cases} \exp \left(-\frac{\|T_i - T_j\|^2}{2\sigma^2}\right), & T_i \in k\text{-NN}(T_j)\ 0, & \text{otherwise} \end{cases}$

Two graphs are constructed per batch: one for fixed teacher tokens and one for student tokens (Zhang et al., 2023). - Fully Connected Graphs (text): Every token is a node, and every ordered pair including self-loops forms a directed edge, yielding adjacency

$A_{ij} = 1,\quad \forall i, j \in \{1, \ldots, N\}$

This grants maximal flexibility for relational modeling via Graph Attention Networks (GATs) (Nguyen, 13 Oct 2025).

3. Learning with Token-Level Relational Structures

Token-level relational graphs support a variety of learning objectives targeting different aspects of relational knowledge.

3.1. Token-Wise Contextual Loss (Visual Distillation)

The contextual loss matches intra-image (inner-instance) token similarity patterns between teacher and student via contextual similarity matrices: $\mathrm{CS}^{(\cdot)} = \mathrm{Softmax}\left(\frac{\mathbf F\,\mathbf F^\top}{\sqrt{D}}\right) \in \mathbb{R}^{N \times N}$ The student is trained to minimize mean-squared error with the teacher: $\mathcal{L}_{\text{inner}} = \frac{1}{N} \|\mathrm{CS}^{\mathcal{T}} - \mathrm{CS}^{\mathcal{S}}\|_{\mathrm{F}}^2$ (Zhang et al., 2023).

3.2. Local and Global Graph Losses

Local Structure Preserving Loss: For each student token, the model aligns its local neighborhood distribution to that of the teacher by minimizing the KL divergence between respective normalized adjacency rows:

$\mathcal{L}_{\text{local}} = \sum_{i} \mathrm{KL}\left( \mathrm{softmax}_{j} (A_{ij}^{\mathcal{S}}) \ \| \ \mathrm{softmax}_{j} (A_{ij}^{\mathcal{T}}) \right)$

Global Contrastive Loss: To align global graph topology, a contrastive InfoNCE-style loss is introduced between paired student-teacher tokens (positives) and all other negatives:

$s_{ij} = \frac{\mathrm{Proj}(T_i^\mathcal{S}) \cdot T_j^\mathcal{T}}{\|\mathrm{Proj}(T_i^\mathcal{S})\| \, \|T_j^\mathcal{T}\|}$

$\mathcal{L}_{\text{global}} = -\sum_{i} \log \frac{ \exp(s_{ii}/\tau_g) }{ \sum_j \exp(s_{ij}/\tau_g) }$

(Zhang et al., 2023).

3.3. Graph Attention Networks in Sequence Labeling

For token-level classification, after extracting PhoBERT embeddings $h_i$ 0, a GAT layer computes pairwise compatibility for every token pair: $h_i$ 1

$h_i$ 2

Node updates aggregate messages over all neighbors, and multiple heads are combined to produce graph-enhanced embeddings (Nguyen, 13 Oct 2025). Further refinement occurs via Transformer-style self-attention before final classification.

4. Integration into Model Architectures

4.1. Knowledge Distillation with Token Graphs

The overall distillation objective in the TRG framework fuses four distinct losses: $h_i$ 3 Optimal distillation is facilitated through additional modules for token extraction and k-NN graph construction without introducing a full graph neural network; the teacher graph stays fixed throughout (Zhang et al., 2023).

4.2. Token-Level Classification with GATs

In TextGraphFuseGAT, PhoBERT-generated embeddings pass through a GAT layer operating on the full token graph, followed by transformer-style self-attention. The classification head applies a linear map and softmax at each token position: $h_i$ 4 with loss masked on padding/subwords (Nguyen, 13 Oct 2025).

5. Empirical Performance and Analysis

Token-level relational structures demonstrate empirical superiority across tasks and modalities.

Image Classification & Distillation: On CIFAR-100, TRG achieves 76.42% Top-1 accuracy (ResNet-32×4→ResNet-8×4), surpassing other graph-based and relational distillation methods (e.g., HKD 76.21%, DKD 76.32%). For heterogeneous architectures, it remains robust (ResNet→ShuffleNet: 76.42% vs. 75.99% for HKD). On ImageNet-LT (long-tailed), TRG Top-1 accuracy (50.32%) exceeds even the teacher (50.11%), confirming enhanced minority class transfer (Zhang et al., 2023).
Token-Level NLP Tasks: TextGraphFuseGAT outperforms both transformer-only and hybrid neural models on Vietnamese NER and disfluency detection. For instance, on PhoNER-COVID19, model achieves Micro-F1 0.984, Macro-F1 0.958 (PhoBERTₗₐᵣgₑ baseline: Micro-F1 0.945, Macro-F1 0.931), and for VietMed-NER reaches F1 0.893 (PhoBERTₗₐᵣgₑ: 0.730, XLM-Rₗₐᵣgₑ: 0.740). Ablation studies attribute >0.16 gain in F1 to the GAT and decoder layers (Nguyen, 13 Oct 2025).

Task/Model	Baseline Metric	Token-Level Relational Structure Metric
CIFAR-100 (TRG)	76.21% (HKD)	76.42% (TRG)
PhoNER-COVID19	0.945 Micro-F1	0.984 Micro-F1
VietMed-NER	0.730 F1 (PhoBERT)	0.893 F1 (TextGraphFuseGAT)

Token-level relational structures show marked benefits in robustness to class imbalance, finer class clustering in feature space, and task generalization.

6. Implications for Long-Tailed and Low-Resource Scenarios

Token-level relational distillation mitigates long-tail effects by leveraging the fact that rare-class instances often contain patch tokens semantically similar to those in abundant classes (e.g., "fur," "wheel"). By pooling across all tokens, relational graphs facilitate knowledge transfer to under-represented areas without overfitting to majority classes (Zhang et al., 2023). Similarly, in low-resource NLP settings, fully connected token graphs and attention mechanisms capture long-range dependencies and syntactic phenomena not present in local sequential windows (Nguyen, 13 Oct 2025).

Empirical results confirm that token-level relational structures can produce student models that not only match but sometimes exceed teacher performance in challenging, imbalanced or domain-specific contexts.

7. Perspectives and Research Directions

Token-level relational structures unify local and global relational information in a generalizable framework, applicable to both vision and sequence learning. The introduction of direct graph-based reasoning (via k-NN or complete graphs) augments or complements transformer-style attention with explicit relational inductive biases. A plausible implication is that further exploration of hybrid architectures—combining graph neural modules, transformers, and domain-specific pretraining—will yield enhanced interpretability and cross-domain transfer, especially for minority or rare-event phenomena.

Key open challenges include efficient scaling of graph construction for long sequences or dense patch tokenizations, integration with dynamic or hierarchical graph structures, and systematic evaluation on multilingual or multi-modal corpora.

Markdown Report Issue Upgrade to Chat

References (2)

Knowledge Distillation via Token-level Relationship Graph (2023)

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-Level Relational Structures.

Token-Level Relational Structures

1. Definition and Motivation

2. Construction of Token-Level Graphs

3. Learning with Token-Level Relational Structures

3.1. Token-Wise Contextual Loss (Visual Distillation)

3.2. Local and Global Graph Losses

3.3. Graph Attention Networks in Sequence Labeling

4. Integration into Model Architectures

4.1. Knowledge Distillation with Token Graphs

4.2. Token-Level Classification with GATs

5. Empirical Performance and Analysis

6. Implications for Long-Tailed and Low-Resource Scenarios

7. Perspectives and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Token-Level Relational Structures

1. Definition and Motivation

2. Construction of Token-Level Graphs

3. Learning with Token-Level Relational Structures

3.1. Token-Wise Contextual Loss (Visual Distillation)

3.2. Local and Global Graph Losses

3.3. Graph Attention Networks in Sequence Labeling

4. Integration into Model Architectures

4.1. Knowledge Distillation with Token Graphs

4.2. Token-Level Classification with GATs

5. Empirical Performance and Analysis

6. Implications for Long-Tailed and Low-Resource Scenarios

7. Perspectives and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research