Token-Sentence-Document Hierarchies in NLP

Updated 30 January 2026

Token-sentence-document hierarchies are multi-scale representation schemes that decompose text into tokens, sentences, and documents to capture both local details and global context.
They utilize diverse architectures such as convolutional layers, recurrent networks, transformer chunking, and graph fusion to efficiently model long-range dependencies.
These hierarchies enhance interpretability and scalability by employing multi-level attention, gradient saliency, and advanced pooling strategies for tasks like classification and summarization.

A token-sentence-document hierarchy refers to a multi-scale representation scheme in natural language processing where document-level modeling is performed by progressively composing local units (tokens), aggregating them into higher-order units (sentences), and finally synthesizing full-document representations from aggregated sentence features. This structuring reflects the intrinsic linguistic and discourse hierarchy in text, facilitates scalable modeling of long-range dependencies, and underpins numerous state-of-the-art architectures for document classification, matching, summarization, segmentation, and cross-lingual transfer. The design space encompasses hierarchical convolutions, RNNs, transformers, attention mechanisms, pre-trained sentence encoders, graph-tree fusions, and hybrid pooling strategies, each with distinct approaches to the passage of information and contextualization across levels.

1. Conceptual Foundations and Historical Context

Hierarchical architectures in NLP arise from limitations of flat models and early neural methods that struggled to encode long documents due to computational bottlenecks and vanishing context. The two-stage convolutional model ("Extraction of Salient Sentences from Labelled Documents" (Denil et al., 2014)) explicitly introduced word→sentence→document convolutional hierarchies and demonstrated that a bottleneck at the sentence level enforced semantic localization, enabling interpretable extraction and introspection via backpropagation-based saliency maps. With the emergence of sequence models (GRU/LSTM), hierarchical attention architectures (e.g., HAN [Yang et al., 2016], "A Systematic Comparison of Architectures for Document-Level Sentiment Classification" (Barnes et al., 2020)) used RNN layers to encode tokens into sentences and sentences into global document vectors, leveraging layered attention mechanisms to capture local and global structure.

Later, transformer-based approaches faced context window limits (e.g., BERT's 512-token cap), prompting innovations such as hierarchical chunking ("Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching" (Yang et al., 2020), "Hierarchical Neural Network Approaches for Long Document Classification" (Khandve et al., 2022)) and block-level aggregation, or extending to graph and tree structures to capture syntactic and semantic relationships ("Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification" (Roy et al., 2024)). Pre-trained sentence encoders (LASER, LaBSE, sBERT) and static/learnable pooling schemes further reinforced the value of decomposing documents for efficient, robust multilingual and cross-domain modeling ("Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?" (Sannigrahi et al., 2023)).

2. Architectural Variants and Layered Composition Strategies

Token-sentence-document hierarchies realize compositionality using distinct architectural regimes:

Convolutional Hierarchies: Cascaded convolutional layers over tokens (word embeddings) extract local sentence features, which are then convolved and pooled to yield fixed-size document vectors; sentence-level and document-level filter banks are parameter-separated, enforcing strict semantic passage (Denil et al., 2014).
Hierarchical RNNs/Attention Networks: Bidirectional GRUs/LSTMs encode word sequences into contextual sentence states, followed by another BiRNN layer with attention over sentences to generate document representations. Weighted attention at each level surfaces salient components (Barnes et al., 2020).
Transformer-based Hierarchy: Documents are split into chunks or sentences, each chunk mapped to a representation via a base transformer (e.g., BERT or USE). Higher-level neural modules (shallow transformers, CNNs, Bi-LSTMs) aggregate these chunk embeddings with max pooling or projection to synthesize a document vector, yielding near-linear complexity scaling (Khandve et al., 2022, Lu et al., 2021, Yang et al., 2020).
Syntax-Tree and Graph Fusion: Sentence encoding leverages dependency/constituency tree transformers, aggregating local syntactic relations into sentence vectors, which are fused via graph attention on a document graph constructed from sentence-document relations. This enables multi-granular dependency capture and supports bidirectional bottom-up and top-down flow of contextualization (Roy et al., 2024).
Pre-trained Sentence Embeddings + Pooling: Off-the-shelf sentence encoders (LASER, LaBSE, sBERT) compute embeddings for each sentence; document-level representations are formed by averaging, TF-IDF weighting, static positional windowing (PERT), or learnable attention/fusion schemes. Empirical results indicate that sentence averaging and positional pooling outperform naive token truncation and global feed-through methods (Sannigrahi et al., 2023).

Model Type	Local Encoder	Aggregation	Attention/Pooling
Conv Hierarchy	Conv1D	k-max pool	None / Gradient saliency
HAN / BiRNN Hierarchy	GRU/LSTM	BiRNN	Word/sentence-level attention
Transformer Hierarchy	BERT/USE	CNN/LSTM	Mean/max pooling
Graph-Tree Fusion	TreeTrans	GAT	Multi-head graph attn
Pre-trained Sentence + Pooling	LASER/sBERT	Averaging	Positional / TF-IDF / attn

3. Information Flow, Attention Mechanisms, and Saliency Extraction

The token-sentence-document hierarchy is not only a compositional structure but an interface for attention and information flow:

Multi-Level Attention: Hierarchical Attention Networks (HAN) implement attention weights $\alpha_{t,i}$ for tokens within sentences and $\beta_t$ for sentences within documents, enabling the attribution and extraction of salient units (Barnes et al., 2020).
Self-Attention on Sentences: Transformer-based hierarchies apply multi-head attention and normalization successively over sentence representations, capturing inter-sentence dependency and yielding attention maps that support explainability, as in the sentence-level hierarchical BERT (Lu et al., 2021).
Gradient Saliency: Convolutional models utilize gradient-based methods to compute the saliency of tokens or sentences with respect to task loss, supporting sentence extraction for downstream evaluation (Denil et al., 2014).
Bidirectional Information Propagation: Graph-tree fusion models propagate context bottom-up (word→sentence→document) via tree transformer and graph attention, and top-down (document→sentence→word), iteratively refining token and sentence representations based on global document context (Roy et al., 2024).

Sentence extraction for explanation or summarization is operationalized by ranking sentences by attention/saliency weights, thresholding, and returning top- $k$ units. Statistical evaluation of extracted sub-documents against full-text classifiers quantifies the preservation of task-relevant information (Denil et al., 2014).

4. Computational Scaling and Efficiency

Hierarchical decomposition directly addresses the scaling bottlenecks of flat sequence encoders. Vanilla transformers are limited by $O(N^2)$ self-attention over input length $N$ , prohibiting efficient modeling of long documents. Hierarchical models reduce this cost by encoding small local units (chunks, sentences) with $O(L^2)$ complexity, then aggregating $M=N/L$ units with only $O(N \cdot L)$ total cost—near-linear in $N$ . Empirical benchmarks on hierarchical BERT, USE+LSTM/CNN, and SMITH architectures confirm that document context windows can be increased from 512 tokens (BERT) to 2 048 tokens (SMITH) or more, with competitive or superior accuracy (Yang et al., 2020, Khandve et al., 2022). Sparse-attention models such as Longformer and BigBird offer an alternative by restricting attention to local windows and global tokens, achieving theoretical $O(N)$ complexity (Khandve et al., 2022).

5. Pre-training, Transfer Learning, and Multilingual Hierarchies

Pre-training schemes for hierarchical models extend standard LM objectives to block-level context aggregation. Hierarchical LM and masked LM objectives jointly update local and block encoders, enforcing context integration—pre-trained hierarchical encoders consistently improve segmentation, passage retrieval, and summarization over local-only counterparts (Chang et al., 2019). Sentence-level hierarchical BERT models exploit frozen token encoders to mitigate overfitting in low-resource scenarios (Lu et al., 2021). In multilingual and cross-lingual contexts, document embedding methods rely on sentence encoders fused by static pooling (TK-PERT), TF-IDF weighting, or learnable attention (ATT-PERT), showing that positionally-aware pooling is essential to outperform token-truncation and naive averaging in document alignment and classification (Sannigrahi et al., 2023).

Pooling/Combination	Training Required	Main Benefit
Sentence Average	None	Top-line zero-shot accuracy
TK-PERT (Positional)	None	Semantic retrieval
ATT-PERT (Learnable)	Yes (downstream)	Low-resource classification
Token Truncation	None	Only for short documents

6. Empirical Performance, Robustness, and Interpretability

Empirical results across document classification tasks (sentiment, news categorization, medical coding) and retrieval/semantic alignment confirm that non-trivial hierarchies outperform flat models and unstructured transfer learning, particularly as document length and structural complexity increase (Barnes et al., 2020, Khandve et al., 2022, Yang et al., 2020). Hierarchical models maintain robustness to sentence order shuffling, exploiting local coherence not captured by flat CNN/LMs. Attention-based and pooling methods support interpretable extraction of salient sentences, with explainability inherent to the hierarchical structure (Lu et al., 2021, Denil et al., 2014). Graph-tree fusion and bidirectional propagation provide further gains in modeling syntax-semantic relations and enable arbitrarily long context processing (Roy et al., 2024).

A plausible implication is that as documents increase in length and structural heterogeneity, the explicit modeling of the token-sentence-document hierarchy (including syntactic parses and graph fusion) becomes increasingly critical for accuracy, robustness, and interpretability.

7. Limitations, Adaptations, and Future Directions

Limitations of current hierarchical approaches include the reliance on explicit sentence segmentation (which may fail in informal or low-resource languages), potential loss of fine-grained positional information in k-max pooling, and computational burdens from stacking multiple neural components. Adaptations are possible—replacing convolutional/pooling layers with RNNs, transformers, or graph/tree attention modules, inserting intermediate levels (paragraph, section), or fine-tuning pooling/attention schemes for domain/task adaptation (Khandve et al., 2022, Denil et al., 2014, Roy et al., 2024). Recent evidence suggests that integrating pre-trained contextualized encoders at each hierarchy level (cross-sentence, cross-paragraph) and enabling end-to-end training with learnable attention can further improve performance. The fusion of syntactic parsing, sentence-level transformers, and document-level graph attention (with bidirectional information flow) represents an active frontier for robust, interpretable, scalable modeling of arbitrarily long and structured documents.

Ongoing work focuses on generalizing these hierarchies to diverse languages, genres, and structures; optimizing hardware efficiency; and aligning interpretability and extraction mechanisms with downstream human-in-the-loop evaluation and multilingual requirements (Sannigrahi et al., 2023, Roy et al., 2024).