Transformer-Based Pre-trained Language Models

Updated 7 February 2026

Transformer-based PLMs are large neural models that use multi-layer self-attention to generate contextual token representations for diverse NLP tasks.
They employ self-supervised objectives such as masked language modeling and next-token prediction on massive unlabeled corpora to learn adaptable features.
Their modular architecture, including encoder-only, decoder-only, and encoder–decoder variants, supports fine-tuning and domain-specific applications like biomedical and legal text analysis.

Transformer-based pre-trained LLMs (PLMs) are large neural architectures obtained by training multi-layer Transformer networks on vast unlabeled text corpora with self-supervised objectives such as masked language modeling (MLM), next-sentence prediction (NSP), or autoregressive next-token prediction. Unlike traditional static embedding approaches, these models produce contextualized token representations, enabling high adaptation to specific NLP tasks with minimal labeled supervision. PLMs have set state-of-the-art across domains including general language understanding, biomedical informatics, legal text analysis, and speech recognition.

1. Architectural Principles and Contextualization

Transformer-based PLMs universally rely on the Transformer architecture: stacks of self-attention (SA) and feed-forward (FFN) layers, punctuated with residual connections and layer normalization (Min et al., 2021). Each layer comprises multi-head self-attention (which models contextual relationships between all sequence positions) and position-wise FFNs. Embedding layers combine token, positional, and, occasionally, segment or auxiliary embeddings.

A key distinction of PLMs is their capacity to yield highly context-specific representations. For token $w$ at position $i$ in a sequence, the model computes a unique vector $h_{\ell,i}(w)$ at each layer $\ell$ , directly dependent on the sentential context. This yields separation between polysemous senses unattainable for static embeddings. Quantitative analyses of BERT reveal that contextualization primarily arises in the SA sub-layer, especially at middle-to-late depths (layers $\sim$ 9–10), with context-driven divergence highest in the multi-head self-attention block. The later output sub-layer partially reimposes token identity, making it less suitable for sense disambiguation (Vijayakumar et al., 2023).

2. Pre-training Objectives and Optimization

PLMs are trained on large corpora with self-supervised tasks:

Masked Language Modeling (MLM): Randomly mask a subset of input tokens; the model predicts missing tokens, optimizing

$\mathcal{L}_{MLM} = -\sum_{i\in M} \log P(x_i| \tilde{x})$

where $M$ is the set of masked positions (Min et al., 2021, Dadas et al., 2020).

Next Sentence Prediction (NSP): Predict whether two segments are consecutive in the original text. Variants include sentence order prediction and segment-level contrastive tasks.
Autoregressive Language Modeling (AR): Predict each token conditioned on all previous tokens for decoder-only models (Min et al., 2021).
Replaced Token Detection (RTD, ELECTRA): Discriminator classifies whether each token was replaced by a generator (Lin et al., 2021, Yang, 2021).
Domain-specific or auxiliary objectives: Biomedical PLMs integrate span boundary objectives, entity masking, or ontology-related supervision for richer representations (Kalyan et al., 2021).

Recent works often drop NSP (e.g., RoBERTa, Polish BERT (Dadas et al., 2020)) and employ dynamic masking, whole-word masking, or domain-tuned masking protocols.

3. Scalability, Architecture Variants, and Training Regimes

Transformer-based PLMs exhibit several structural variants:

Encoder-only models (BERT, RoBERTa): Bidirectional context modeling via MLM/NSP. Common in NLU tasks (Min et al., 2021, Yang, 2021).
Decoder-only models (GPT): Causal language modeling, used for text generation (Shen, 2021, Yang, 2021).
Encoder–decoder models (T5, BART): Sequence-to-sequence objectives for summarization, translation, or denoising (Shen, 2021).
Architectural depths and widths are critical for downstream performance. FPM systematically demonstrates existence of task-optimal depth (e.g., 70 layers for QQP), with larger models outperforming smaller ones up to an optimum—beyond which, overfitting or training instability occurs (Shen, 2021).
Stable training at scale requires careful optimizer scheduling, mixed precision (when possible), gradient clipping, and in some cases, architectural modifications (e.g., sparse attention for long contexts in KoBigBird) (Shen, 2021, Yang, 2021).
Monolingual, domain-specific corpora consistently yield superior results for specialized domains/languages compared to multilingual baselines, as observed in Polish, Korean, Lao, and legal-domain PLMs (Dadas et al., 2020, Yang, 2021, Lin et al., 2021, Paul et al., 2022).

4. Adaptation Mechanisms and Task Transfer

PLMs support multiple adaptation paradigms:

Full fine-tuning: Entire model updated on labeled task data with a task-specific head (classification/regression/sequence tagging) (Dadas et al., 2020, Min et al., 2021).
Parameter-efficient tuning: Methods such as prompt tuning, adapters, and BitFit freeze PLM parameters and train only small auxiliary parameters (Zhu et al., 2022, Wang et al., 2022).
Prompting and prefix/pre-modulation: Tasks are re-expressed as masked or completion-style prompts, either discrete or continuous (learned vector prefixes) (Zhu et al., 2022, Min et al., 2021). Soft prompts can recover up to 95% of fine-tuning gains.
Knowledge distillation and inheritance: Large PLMs serve as teachers to smaller or shallow students via KL-divergence (soft target) losses; this accelerates convergence and enables efficient scaling (Zhu et al., 2022, Qin et al., 2021).
Domain adaptation: Continued pre-training of general or near-domain PLMs on domain-specific corpora enables transfer to new linguistic or technical domains (e.g., Indian Law (Paul et al., 2022), biomedical corpora (Kalyan et al., 2021)).

5. Internal Mechanisms and Model Interpretability

PLMs encode “skills”—task-relevant computations—in an unexpectedly sparse subset of their sub-components:

Skill neurons: After prompt tuning, a small subset of feed-forward neurons achieve high predictivity for task labels when their activation on prompt tokens is thresholded. Perturbing these neurons degrades accuracy substantially; their importance is largely present at pre-training (Wang et al., 2022).
Contextualization locus: Context-driven sense disambiguation predominantly arises in multi-head self-attention sub-layers of middle-to-late layers. The activation sub-layer refines this, and the output sub-layer partially restores the identity information via residual connections (Vijayakumar et al., 2023).
Attention head specialization: Heads in middle layers encode disambiguating cues essential for sense separation, while some heads (“CLS attenders”) focus on global or sentence-level signals (Vijayakumar et al., 2023).

These findings inform interpretability research and methodology development for pruning (removing non-skill neurons/heads) or transferability assessment.

6. Domain-Specific and Multilingual Adaptation

PLMs are extensible to low-resource or domain-specific languages and technical domains:

Biomedical: Domain-specific pre-training (PubMedBERT), intermediate fine-tuning, special tokenization, and ontology-enriched embeddings are standard. Robustness to domain shift, noise, and fairness issues are crucial challenges (Kalyan et al., 2021).
Legal: Continued pre-training and vocabulary adaptation to local legal text (e.g., Indian Law) yield higher accuracy and improved alignment with expert attention (Paul et al., 2022).
Korean, Polish, Lao: Monolingual PLMs (with domain-matched corpora and tokenizers) outperform multilingual models on all downstream tasks, including POS tagging, classification, paraphrase, and NER (Dadas et al., 2020, Yang, 2021, Lin et al., 2021).
Cross-modal transfer: Transformer-based PLMs pre-trained on text transfer to speech recognition tasks (ASR)—repurposed self-attention blocks improve word/character error rate when used as encoders, even when initially trained on text only (An et al., 2024).

7. Current Limitations and Prospects

Impossible Triangle: PLMs cannot simultaneously offer (P1) moderate model size ( $\leq 1$ B parameters), (P2) state-of-the-art few-shot/zero-shot ability, and (P3) state-of-the-art fully supervised accuracy. Existing models occupy only two vertices (e.g., GPT-3 achieves P2+P3, not P1). Advanced knowledge distillation, data augmentation, and universal prompt learning are active approaches to bridging these gaps (Zhu et al., 2022).
Scalability and Efficiency: Model and memory efficiency (through targeted pruning, adapter layers, and mixture-of-experts) remain open domains (Shen, 2021).
Interpretability and Bias: Mechanistic understanding of knowledge storage, debiasing (especially in biomedical/legal), privacy preservation, and representation probing are ongoing challenges (Wang et al., 2022, Kalyan et al., 2021).
Task and Modality Transfer: Automatic methods for task-optimal architecture search, cross-lingual/intermodal transfer, and synthetic data generation are under investigation (Min et al., 2021, An et al., 2024, Zhu et al., 2022).

A staged roadmap targets short-term improvements to the missing triangle vertex, medium-term per-task solutions, and eventually universal moderate-sized PLMs that generalize across tasks and modalities via unified architectures and multi-objective pre-training (Zhu et al., 2022).

This entry integrates evidence spanning analyses of model structure, training paradigms, scalability, adaptation, interpretability, and specialization, illustrating the technical landscape and ongoing advances in transformer-based pre-trained LLMs (Vijayakumar et al., 2023, Dadas et al., 2020, Zhu et al., 2022, Qin et al., 2021, Shen, 2021, An et al., 2024, Paul et al., 2022, Kalyan et al., 2021, Lin et al., 2021, Wang et al., 2022, Yang, 2021, Min et al., 2021).