Papers
Topics
Authors
Recent
Search
2000 character limit reached

TinyFinBERT: Compact Financial Sentiment Model

Updated 12 February 2026
  • TinyFinBERT is a compact transformer designed for financial sentiment analysis, distilled from FinBERT using a dual-stage process with GPT-driven synthetic data.
  • It employs a two-tiered knowledge distillation strategy that first fine-tunes FinBERT with augmented data and then transfers layer-wise knowledge to a smaller student model.
  • The model achieves nearly 99% of FinBERT’s performance on key financial datasets while offering significant gains in speed and efficiency for edge deployments.

TinyFinBERT is a compact transformer model specialized for financial sentiment analysis, created by distilling knowledge from an augmented FinBERT (a BERT variant fine-tuned for finance) using a structured, two-tiered process involving LLM-generated synthetic data. The approach leverages GPT-4 Omni and GPT-3.5 Turbo for targeted data augmentation and guides layered distillation to produce a model that matches or closely approaches the performance of much larger teacher networks while maintaining a significantly reduced parameter count and computational footprint (Thomas, 2024).

1. Model Hierarchy and Architectural Specifications

Three primary models define the TinyFinBERT workflow:

Model Transformer Layers Hidden Size Attention Heads Parameters (M)
FinBERT (Teacher) 12 768 12 110
TinyBERT General-4L-312D 4 312 12 14.5
TinyFinBERT (Distilled) 4 312 12 14.5

FinBERT, based on BERT-base, contains 12 encoder layers and serves as the teacher model. TinyBERT General-4L-312D is utilized as a student architecture template, with 4 encoder layers and a hidden size of 312. TinyFinBERT, the final distilled model, adopts the architecture of TinyBERT General-4L-312D, resulting in it being approximately 7.6 times smaller than FinBERT (14.5 million vs. 110 million parameters).

2. GPT-Augmented Data Generation for Domain Adaptation

Domain-specific data scarcity in finance is addressed through a dual-source LLM augmentation framework:

  • Synthetic Labeled Data with GPT-4 Omni ("gpt-4o-2024-05-13"):
    • 693 boundary-case sentences generated, targeting FinBERT’s “positive ↔ neutral” misclassifications; after filtering by five deterministic labeling passes (temperature = 0) and unanimous agreement, 410 were retained.
    • 382 mislabelled sentences identified in FinBERT’s training set underwent label verification (five GPT-4o passes), retaining 124. Up to 10 paraphrases were generated per sentence, yielding 1,219 paraphrases, of which 1,001 were retained after relabeling.
    • In total, 1,411 labeled synthetic examples (410+1,001) were used for FinBERT fine-tuning.
  • Unlabeled Data with GPT-3.5 Turbo ("gpt-3.5-turbo-0125"):
    • 3,494 correctly labeled examples produced 17,458 variations.
    • 382 mislabelled sentences generated 13,148 variations.
    • An additional 3,028 GPT-4o-generated but unused examples.
    • The aggregate for distillation amounted to 33,634 unlabeled examples.

This LLM-driven data augmentation approach specifically targets ambiguous cases and paraphrase diversity, yielding an enriched training signal for subsequent distillation (Thomas, 2024).

3. Two-Tiered Knowledge Distillation Strategy

Distillation proceeds in two sequential tiers:

Tier 1: Enhanced FinBERT Fine-Tuning

FinBERT is fine-tuned on a combined corpus of original and GPT-4o-labelled synthetic data. The process uses:

  • Standard cross-entropy loss on hard labels.
  • Optimization via AdamW, a slanted triangular rate (warm-up = 20%), discriminative fine-tuning (layer-wise learning rate scaling, discrimination rate = 0.95), gradual unfreezing (tier-by-tier, one per epoch), dropout of 0.1, with a maximum sequence length of 64 and batch size 64 over 6 epochs.

Tier 2: TinyFinBERT Distillation from Augmented FinBERT

  • Layer Mapping: Student layer mm learns from teacher layer g(m)=3mg(m)=3\cdot m (e.g., student-1 ← teacher-3).
  • Distillation Losses (temperature τ=1\tau=1):
    • Prediction-layer (soft targets): Lpred=CE(σ(zs/τ),σ(zt/τ))\mathcal{L}_{\text{pred}} = \operatorname{CE}(\sigma(z_s/\tau),\sigma(z_t/\tau))
    • Attention heads: Lattn=i=1hMSE(Asi,Ati)\mathcal{L}_{\text{attn}} = \sum_{i=1}^h \operatorname{MSE}(A^i_s,A^i_t)
    • Hidden states: Lhid=MSE(HsWh,Ht)\mathcal{L}_{\text{hid}} = \operatorname{MSE}(H_s W_h, H_t)
    • Embedding: Lemb=MSE(EsWe,Et)\mathcal{L}_{\text{emb}} = \operatorname{MSE}(E_s W_e, E_t)
    • Unified per-layer: Llayer,m=αmLattn+βmLhid+γmLemb+δm[m{0,M+1}?0:Lpred]\mathcal{L}_{\text{layer},m} = \alpha_m \mathcal{L}_{\text{attn}} + \beta_m \mathcal{L}_{\text{hid}} + \gamma_m \mathcal{L}_{\text{emb}} + \delta_m [m \notin \{0, M+1\} ? 0 : \mathcal{L}_{\text{pred}}]
  • Overall KD loss: LKD=m=0M+1Llayer,m\mathcal{L}_{\text{KD}} = \sum_{m=0}^{M+1} \mathcal{L}_{\text{layer},m}
  • Distillation Procedure: Phase 1 (intermediate layer distillation) for 20 epochs; Phase 2 (prediction layer distillation) for 3 epochs (AdamW, batch size 32, learning rate 5×1055\times10^{-5}, warm-up = 10%, max sequence length 64).

4. Training Data and Evaluation Suite

  • Datasets:
    • Financial PhraseBank: 4,846 sentences (≥ 50% annotator agreement), split 80%/20% for train/test.
    • FiQA 2018 Task 1: 1,113 headlines/tweets, continuous sentiment mapped to three-class categories.
    • Forex News Annotated: 2,291 currency-pair headlines, manually labeled by class.

The fine-tuned and distilled models are rigorously evaluated on both in-domain (PhraseBank) and out-of-domain (FiQA 2018, Forex) datasets.

5. Performance Metrics and Efficiency Analysis

Quantitative Results

Dataset Model Accuracy F1 Precision Recall
PhraseBank (Test) FinBERT 0.8423 0.8439 0.8545 0.8423
Augmented FinBERT 0.8742 0.8739 0.8743 0.8742
TinyFinBERT 0.8330 0.8330 0.8333 0.8330
FiQA 2018 Task 1 FinBERT 0.5265 0.5563 0.6642 0.5265
Augmented FinBERT 0.6217 0.6385 0.6709 0.6217
TinyFinBERT 0.5660 0.5944 0.6560 0.5660
Forex News Annotated FinBERT 0.4801 0.4449 0.4988 0.4801
Augmented FinBERT 0.4950 0.4797 0.5081 0.4950
TinyFinBERT 0.4775 0.4572 0.4923 0.4775

Key findings include TinyFinBERT maintaining approximately 98.9% of FinBERT’s performance on PhraseBank, while being 7.6× smaller and yielding around 3× gains in inference speed and memory usage. The two-stage pipeline (GPT-4/3.5 data augmentation and distillation) yields notable generalization gains, especially on FiQA and Forex benchmarks (Thomas, 2024).

6. Implications and Practical Applications

The integration of LLM-generated synthetic data directly addresses data scarcity and edge-case ambiguity in financial sentiment analysis. Augmented FinBERT exhibits improved accuracy over its unaugmented predecessor, while the TinyFinBERT workflow demonstrates how model compactness and efficiency do not inherently require sacrifices in accuracy. The combination of layer-wise and logit distillation leads to a compact model suitable for deployment in computationally constrained environments, such as mobile devices, real-time trading platforms, and other edge settings. The results support the use of LLM-driven augmentation and student distillation schemes to produce robust, domain-specialized models.

7. Key Insights and Prospects for Extension

  • GPT-4/3.5-facilitated augmentation substantially enhances both teacher and student models by focusing training on challenging and ambiguous cases.
  • Two-stage distillation, involving intermediate and prediction-layer matching, offers a pathway to students that retain ≈ 99% of teacher accuracy for financial sentiment tasks.
  • The methodology balances the size–accuracy trade-off and supports practical model deployment where compute and memory are limited.
  • The data-generative and distillation pipeline demonstrated here is extensible to other domain-specific NLP tasks, contingent upon the availability of a sufficiently specialized teacher and targeted LLM data augmentation (Thomas, 2024).

TinyFinBERT exemplifies how progressive data augmentation via state-of-the-art LLMs and structured distillation enables the realization of high-performing yet compact NLP models for specialized financial analytics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TinyFinBERT.