GatorTronGPT: Clinical Transformer Models

Updated 22 February 2026

GatorTronGPT is a family of large, decoder-only Transformer models that use prompt-tuning and LoRA adapters for parameter-efficient clinical NLP.
It achieves state-of-the-art performance in tasks like clinical concept extraction, relation extraction, and summarization with metrics showing significant improvements over baselines.
The models efficiently process diverse clinical tasks while reducing training time and computational overhead, facilitating rapid adaptation to new clinical applications.

GatorTronGPT is a family of large, decoder-only Transformer LLMs for clinical and biomedical text analytics, developed following the GPT-3 architectural paradigm. Trained on a hybrid corpus of clinical notes and open-domain English, GatorTronGPT models are specifically optimized for parameter-efficient adaptation via prompt-tuning. Their design enables high performance across a spectrum of clinical NLP tasks, including information extraction, relation extraction, summarization, and medical text generation, while minimizing computational overhead and safeguarding source domain knowledge.

1. Model Architecture and Pretraining

GatorTronGPT adopts a GPT-3–style decoder-only Transformer architecture. Two principal model sizes are reported: GatorTronGPT-5B with approximately 5 billion parameters, and GatorTronGPT-20B with around 20 billion parameters. The 20B instance comprises 44 Transformer decoder blocks, each implementing masked self-attention with $h=48$ heads ( $d_\text{model}=6144$ , $d_k=d_v=128$ ), position-wise feed-forward networks (inner dimension $d_\text{ff} \approx 16,384$ ), layer normalization, residual connections, and rotary position embeddings (RoPE). Tokenization uses byte-pair encoding with a vocabulary size near 50,000 subword units (Peng et al., 2023, Lyu et al., 2024, Lyu et al., 2024, Peng et al., 2023, Peng et al., 5 Sep 2025).

Pretraining leverages 277 billion tokens: 82 billion de-identified clinical notes from UF Health EHR (2011–2021), complemented by 195 billion tokens from general English sources (the Pile, Wikipedia, web text). The mixture is approximately 30% clinical and 70% open-domain, targeting clinical linguistic competence while preserving robustness in general language understanding. The objective is standard left-to-right causal language modeling, minimizing the log-likelihood of next-token prediction:

$L(\theta) = -\sum_{t=1}^N \log P_\theta(x_t | x_{<t}),$

with perplexity $PP = \exp(L(\theta)/N)$ . Implementation employs Megatron-LM, with both tensor and data parallelism across large-scale GPU clusters (Peng et al., 2023, Lyu et al., 2024, Lyu et al., 2024, Peng et al., 5 Sep 2025).

2. Prompt-Tuning and Parameter-Efficient Adaptation

GatorTronGPT operationalizes parameter-efficient fine-tuning (PEFT) through prompt-tuning, inserting contiguous “soft prompts”—learnable virtual token embeddings—prepended to the input sequence. All original model weights ( $\theta$ ) remain frozen; only the soft prompt parameters ( $P$ ) are optimized. Prompts are typically $32$–$512$ tokens in length, with empirical optima around $100$–$128$ for clinical tasks. Initialization strategies include multi-layer perceptrons (MLP), randomly-initialized matrices, or recurrent stacks such as LSTMs (Lyu et al., 2024, Peng et al., 2024, Peng et al., 2023, Lyu et al., 2024). The learning objective remains standard cross-entropy for sequence generation:

$P^* = \arg\min_P \mathcal{L}(\mathrm{LLM}([P; X]), Y),$

where $X$ is the tokenized input and $Y$ is the target. Prompt-tuning can also be extended to “deep” schemes (P-tuning v2), inserting virtual tokens into every Transformer block (Peng et al., 2024).

Some experiments leverage LoRA (Low-Rank Adaptation) adapters per attention block for additional PEFT flexibility, where only low-rank matrices are updated:

$\Delta W = A B,\quad A \in \mathbb{R}^{d \times r},\, B \in \mathbb{R}^{r \times d}$

with $r$ (rank) up to $256$, $d$ the block dimension. This approach merges LoRA weights into base parameters at inference (Peng et al., 5 Sep 2025).

3. Unified Text-to-Text Clinical Task Formulation

GatorTronGPT employs a unified text-to-text architecture: all clinical NLP problems are cast as generative sequence tasks with templated instructions. Seven major clinical tasks have been demonstrated:

Clinical concept extraction
Clinical relation extraction
Concept normalization (e.g., UMLS mapping)
Clinical abbreviation disambiguation
Natural language inference
Medication event/context classification
Progress note understanding (e.g., assessment–plan relation)

The task-specific prompt template is prepended to input, e.g.:

1 2	\|VIRTUAL_PROMPT\|> Input: {raw clinical note} Output:

Target outputs are fully formatted, natural-language answers, allowing direct application of generative modeling without task-specific heads or pipelines (Peng et al., 2023, Peng et al., 2024, Peng et al., 5 Sep 2025).

Multi-task instruction tuning further unifies CCE and CRE as generative tasks, employing instruction prompts sampled from a diverse set of datasets and tasks (Peng et al., 5 Sep 2025).

4. Empirical Evaluation and Performance

GatorTronGPT demonstrates state-of-the-art or near–state-of-the-art results across diverse clinical NLP benchmarks:

Task	Metric	Baseline	GatorTronGPT-20B	Improvement
Concept extraction (2018 n2c2)	F1	0.88 (BERT/GT)	0.9060	+2.6%
Concept extraction (2022 n2c2)	F1	0.8318 (BERT)	0.8615	+3.0%
Relation extraction (2018 n2c2)	F1	0.9545 (BERT)	0.9614	+0.7%
Relation extraction (2022 n2c2)	F1	0.7807 (BERT)	0.8529	+7.2%
Concept normalization	strict-F1	0.757 (ezDI)	0.791	+3.4%
Abbreviation disambiguation	F1	0.948-0.960	0.9836	+2.4–3.6%
NLI (MedNLI)	Accuracy	0.805	0.8946	+8.9%

For doctor–patient dialogue summarization (MTS-DIALOG, test set), the GatorTronGPT-20B prompt-tuned model yields ROUGE-1 = 0.3628, ROUGE-2 = 0.1549, ROUGE-L = 0.3472, BLEU = 0.3665, BERTScore = 0.7309—surpassing both prompt-tuned 5B (BLEU = 0.3383, BERTScore = 0.6993) and fine-tuned T5-Large (BLEU = 0.3425, BERTScore = 0.6765). Likewise, for cross-institution and cross-disease SDoH extraction, GatorTronGPT-20B (prompt-tuning) outperforms encoder-only baselines by 8.9–21.8% F1 (Lyu et al., 2024, Peng et al., 2024, Peng et al., 2023).

In the BioNLP 2024 “Discharge Me!” shared task, a hybrid system using concept extraction plus GatorTronGPT generation achieved BLEU-4 = 0.1211, ROUGE-1 = 0.3958, ROUGE-2 = 0.1790, ROUGE-L = 0.2699, BERTScore = 0.3894, and 5th place (overall metric 0.284) (Lyu et al., 2024).

Zero- and few-shot instruction tuning further improves performance: averaging F1 ≈ 0.706–0.727 for concept extraction with as few as 5–20 labeled examples (Peng et al., 5 Sep 2025).

5. Computational Considerations and Efficiency

Prompt-tuning and LoRA-based PEFT restrict updates to a small set of parameters: soft-prompt embeddings ($70$M for 5B, $302$M for 20B model) or LoRA matrices ( $<$ 1% of model size). No base LLM parameters are altered during adaptation. This leads to substantial reductions in wall-clock training time and GPU memory requirements. On MTS-DIALOG, prompt-tuning GatorTronGPT-20B requires $4$ h $23$ min (vs. $9$ h $34$ min for T5-Large full fine-tuning, both on 8 × A100-80GB GPUs), and inference latency is equivalent to standard GPT inference plus negligible prompt handling (Lyu et al., 2024, Peng et al., 2023). Training times for LoRA-based GatorTronGPT-20B (5–8 GPU-hours) are roughly one-sixth those for 9B full-tuned encoder models (Peng et al., 5 Sep 2025).

Parameter-efficient adaptation also provides scalability for deploying institution- or task-specific prompt “heads” using a single core LLM.

6. Clinical Applications, Human Evaluation, and Implications

GatorTronGPT’s parameter-efficient generative paradigm supports numerous clinical use cases:

Automated extraction and normalization of clinical entities and relations for electronic health records (EHRs).
Summarization of lengthy, multi-turn doctor–patient dialogues for automated note-taking, discharge summary generation, and SOAP documentation (Lyu et al., 2024, Lyu et al., 2024).
Question answering and medical reasoning, as well as synthetic clinical data generation for privacy-preserving model training (Peng et al., 2023).

A clinical “Turing test” using UF Health notes found no significant difference in linguistic readability (6.57 GPT vs. 6.93 human, $p=0.22$ ) or clinical relevance (7.00 vs. 6.97, $p=0.91$ ), with clinicians unable to reliably distinguish AI- from human-generated notes ( $p<0.001$ ) (Peng et al., 2023).

Soft prompts encapsulate institutional policies, specialty-specific report formats, and can be swapped for rapid task/domain transfer. Unified, frozen LLM deployment simplifies system integration, reduces catastrophic forgetting, and supports few-shot adaptation to new specialties or note types (Peng et al., 2023, Peng et al., 5 Sep 2025).

7. Limitations, Ongoing Challenges, and Prospective Directions

Key challenges include hallucination control—occasional nonlogical or contextually inappropriate generations can occur, necessitating robust post-processing, constrained decoding, or reinforcement learning with human feedback (RLHF) (Peng et al., 2023, Peng et al., 2023). Certain complex tasks (e.g., nested/discontinuous span extraction, assessment–plan reasoning) remain difficult, with unified prompt-tuned models sometimes trailing specialized ensembling approaches (Peng et al., 2023). Domain shifts (e.g., NER model transfer across epochs or institutions) can degrade entity coverage and semantic fidelity (Lyu et al., 2024, Peng et al., 2024).

Future work envisions expanding entity schemas for richer NER, augmenting prompt-tuning with light adapter modules for increased flexibility, and generalizing prompt templates for unified multi-section or patient-centered documentation. Improvements in multi-task instruction tuning are expected to further close gaps in zero- and few-shot generalization (Lyu et al., 2024, Peng et al., 5 Sep 2025, Peng et al., 2023).

References

“Automatic Summarization of Doctor-Patient Encounter Dialogues Using LLM through Prompt Tuning” (Lyu et al., 2024)
“A Study of Generative LLM for Medical Research and Healthcare” (Peng et al., 2023)
“UF-HOBI at ‘Discharge Me!’: A Hybrid Solution for Discharge Summary Generation Through Prompt-based Tuning of GatorTronGPT Models” (Lyu et al., 2024)
“Improving Generalizability of Extracting Social Determinants of Health Using LLMs through Prompt-tuning” (Peng et al., 2024)
“A Study of LLMs for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning” (Peng et al., 5 Sep 2025)
“Generative LLMs Are All-purpose Text Analytics Engines: Text-to-text Learning Is All Your Need” (Peng et al., 2023)