Papers
Topics
Authors
Recent
Search
2000 character limit reached

TITAN in Computational Pathology

Updated 26 December 2025
  • TITAN (Computational Pathology) is a large-scale, knowledge-enhanced Transformer model that integrates structured data to optimize both natural language understanding and generation.
  • It employs a shared-bottom, task-specific-top architecture using Transformer-XL and progressive learning strategies to achieve robust scalability and efficiency.
  • Innovations such as adversarial and controllable generation, online distillation, and green training promote faster convergence and reduced energy consumption.

ERNIE 3.0 Titan is a large-scale, knowledge-enhanced pre-trained LLM developed within the ERNIE 3.0 continual multi-paradigm framework. It advances the integration of structured knowledge into transformer-based LLMs, scaling from 10 billion to over 260 billion parameters. The model achieves strong results across a variety of Chinese natural language understanding (NLU) and generation (NLG) tasks, and incorporates novel architectural, algorithmic, and system-level innovations to improve knowledge incorporation, controllable generation, training efficiency, and scalability (Sun et al., 2021, Wang et al., 2021).

1. Model Architecture and Design

ERNIE 3.0 Titan employs a “shared-bottom, task-specific-top” architecture built on Transformer-XL. The Universal Representation Module (URM) encodes shared lexical and syntactic features through deep, recurrent self-attention. Two Task-specific Representation Modules (TRMs) operate on top of the URM to provide specialized representations for NLU and NLG tasks.

Architecture parameters:

Module Layers Hidden Size Heads Seq Len Mem Len Parameters
Universal Representation 48 12,288 192 512 128 > 260B (latest)
Task-specific NLU branch 12 768 12 512
Task-specific NLG branch 12 768 12 512 128

For the 10B-parameter scale (Sun et al., 2021), the URM uses a hidden size of 4096 with 64 heads; at 260B scale (Wang et al., 2021), the URM is expanded to a hidden size of 12,288 with 192 heads. TRMs remain at a manageable 12 layers to facilitate downstream fine-tuning. The NLU branch uses bi-directional attention, while the NLG branch employs uni-directional attention with recurrence memory. This modularization allows for disentangled optimization of NLU and NLG objectives in continual multi-paradigm pre-training.

2. Knowledge-Enhanced Pre-training Objectives

ERNIE 3.0 Titan interleaves several pre-training tasks, unifying auto-encoding and auto-regressive objectives while introducing knowledge enrichment at scale. The pre-training loss is expressed as:

Ltotal(θ)=LMLM+LDLM+LSR+LSD+LUKTPL_\text{total}(\theta) = L_\text{MLM} + L_\text{DLM} + L_\text{SR} + L_\text{SD} + L_\text{UKTP}

Where:

  • LMLML_\text{MLM}: Masked Language Modeling over masked tokens, spans, and entities (AE).
  • LDLML_\text{DLM}: Document Language Modeling (AR), leveraging Transformer-XL's recurrence memory.
  • LSRL_\text{SR}: Sentence Reordering, predicting the correct configuration among m!m! permutations.
  • LSDL_\text{SD}: Sentence Distance, a 3-way classification for inter- and intra-document relationships.
  • LUKTPL_\text{UKTP}: Universal Knowledge–Text Prediction, combining relation classification from triples and masked word prediction constrained by knowledge graph alignment.

Special entity markers ([HD], [/HD], [TL], [/TL]) ensure that the model explicitly learns mappings between knowledge graph triples and their textual mentions. This knowledge-infused formulation contrasts with plain-text-only paradigms such as GPT-3 and T5.

3. Additional Training Strategies and Infrastructure

Progressive Learning

ERNIE 3.0 Titan employs progressive learning to accelerate convergence for very large-scale training. Critical hyperparameters—sequence length, batch size, learning rate, and dropout—are increased gradually during pre-training. In small/medium variants, this reduces convergence time by up to 65%, enabling practical training at 10B+ scale (Sun et al., 2021).

Controllable and Credible Generation (260B Scale)

At 260B scale, two auxiliary losses are introduced (Wang et al., 2021):

  • Self-supervised Adversarial Loss (Ladv\mathcal{L}_{\mathrm{adv}}): The model learns to distinguish between original and adversarially generated paragraphs, promoting generation credibility and penalizing low-quality samples.
  • Controllable Language Modeling Loss (Lctrl\mathcal{L}_{\mathrm{ctrl}}): Attribute-conditioned, prompt-based training enables targeted control over output attributes (genre, topic, sentiment, length). Randomly omitting soft prompts during training ensures generalization.

Large-scale Infrastructure

Training and inference leverage PaddlePaddle's 4D hybrid parallelism, combining data-, tensor-, pipeline-, and sharded-parallel strategies. Resource-aware partitioning and static shape conversion exploit both GPU (V100) and NPU (Ascend 910) hardware, achieving 91.7% weak scaling and reduced energy consumption by optimizing memory and compute resources (Wang et al., 2021).

4. Online Distillation and Green Training

ERNIE 3.0 Titan introduces an online, multi-student knowledge distillation strategy to mitigate the computational and environmental costs of deploying multi-hundred-billion parameter models:

  • On-the-Fly Distillation (OFD): Teacher and student models are updated in tandem, eliminating separate teacher-only inference stages.
  • Teacher Assistant (TA): An intermediate-capacity model (e.g., 24 layers, ∼10B parameters) bridges the gap between Titan and smaller downstream student models.
  • Auxiliary Layer Distillation (ALD): Student models have an auxiliary layer during distillation to ensure full gradient flow, which is later removed during fine-tuning or deployment.

This strategy reduces peak GPU/NPU usage time and cumulative carbon emissions. Empirical speedups are observed, with 12L-768H students reaching baseline latency and 6L-768H students running up to 2× faster than baseline (Wang et al., 2021).

5. Training Corpus and Preprocessing

Titan is trained on a ∼4 TB pre-training corpus comprising eleven categories, including Baidu Baike, Chinese Wikipedia, web text, user logs (Zhidao, Tieba, Baijiahao), news, novels, poetry, couplets, and domain-specific sources (medical, law, finance). An additional 0.7B-token Baidu knowledge graph with 50M fact triples provides structured world knowledge. For the adversarial and controllable objectives, specifically curated data pools with synthetic and attribute-annotated samples are utilized (Sun et al., 2021, Wang et al., 2021).

Data is subjected to:

  • Deduplication at character, paragraph, and document levels (using MD5 hashes of top-3 sentences).
  • Filtering (minimum sentence length, word segmentation).
  • Corpus balancing via upsampling for underrepresented domains.

6. Empirical Results and Benchmarking

Performance Highlights:

Setting Metrics/Benchmarks Results (Titan) Comparison
SuperGLUE (English) Overall score 90.6 +0.8 over human baseline (Sun et al., 2021)
Chinese News (TNEWS) Zero-shot accuracy 68.40% vs. 60.26% (PanGu-α-13B)
Semantic Similarity Zero-shot (AFQMC) 68.99% vs. 65.76%
Fine-tuning (Chinese) SOTA across 54 tasks; NLI (OCNLI): 82.75% SOTA; outperforms prior models and human baseline
Few-shot FewCLUE, multi-task Titan surpasses mT5-XXL, Yuan 1.0, ERNIE 3.0
Zero-shot (Chinese) CBQA (CKBQA-sub): 22.84%, Cloze (CHID): 86.21% Superior to GPT-3, CPM-1

Human evaluations across 467 zero-shot cases assign Titan scores of 1.69/1.53/1.09 (coherence/fluency/accuracy out of 2), exceeding both GPT-3 and state-of-the-art Chinese baselines by 0.3–0.6 (Wang et al., 2021).

7. Innovations and Future Research

ERNIE 3.0 Titan demonstrates the benefit of scaling knowledge-infused dense models for both NLU and NLG, integrating:

  • Structured and unstructured data through knowledge-augmented pre-training objectives.
  • Disentangled representations for dual-mode (auto-encoding/autoregressive) training.
  • Adversarial and controllable generation objectives for output quality and user control.
  • Environmental sustainability through efficient online distillation.

Future research directions target continual pre-training over richer structures (e.g., multi-modal content, code, or tables), advanced sparsity and routing for further efficiency scaling, enhanced controllable and factual generation, and fine-tuning distilled student models for edge and specialized tasks (Wang et al., 2021). These priorities position ERNIE 3.0 Titan as a paradigm for the next generation of foundation models with integrated knowledge and scalable engineering.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TITAN (Computational Pathology).