Sentiment Classifier: IMDB & BERT

Updated 4 February 2026

The paper demonstrates that fine-tuned BERT significantly improves sentiment prediction accuracy on IMDB compared to lexicon-based, logistic regression, and LSTM models.
Preprocessing includes HTML stripping, lemmatization, and tokenization using WordPiece, ensuring robust text normalization for effective classification.
Evaluation metrics reveal an F1 score of 0.923 and accuracy improvements ranging from 2.9% to 29.5%, establishing BERT as the state-of-the-art method.

A sentiment classifier with IMDB and BERT refers to a supervised machine learning pipeline in which the Bidirectional Encoder Representations from Transformers (BERT) model is fine-tuned to predict sentiment polarity (positive or negative) on IMDB movie reviews. This classifier leverages the large, manually-labeled IMDB dataset to train, validate, and evaluate highly contextual transformer-based representations. The use of BERT markedly improves upon previous sentiment classification techniques, as measured by accuracy, precision, recall, and F1 score, and has been established as the superior approach when compared to lexicon-based, linear, and recurrent neural network models (Alaparthi et al., 2020).

1. Dataset Description and Preprocessing

The IMDB sentiment classification benchmark consists of 50,000 reviews—25,000 labeled positive and 25,000 labeled negative, with extreme polarities determined by rating threshold (≤4 negative, ≥7 positive). For BERT-based experiments, a typical split is 35% train (17,500), 15% development (7,500), 50% test (25,000) (Alaparthi et al., 2020).

Preprocessing for BERT involves the following normalization pipeline:

Strip HTML tags.
Remove accented characters.
Expand contractions.
Remove special characters, URLs, and user mentions.
Segment sentences by punctuation, drop punctuation.
Lemmatize tokens to root forms.
Lowercase all tokens.
Remove rare words (appearing in <1% of documents).
Remove stop-words.

For BERT, tokenization uses the standard WordPiece algorithm with a vocabulary of ~30,000 sub-word tokens. Each document is converted into a sequence of input IDs, segment IDs, and positional encodings, padded or truncated to a maximum length of 512 tokens (Alaparthi et al., 2020 Gosai et al., 12 Jan 2026).

2. Model Architecture and Fine-Tuning

BERT-Base consists of 12 Transformer encoder layers, each with hidden dimensionality H=768 and 12 self-attention heads, totaling approximately 110M parameters (Alaparthi et al., 2020 Gosai et al., 12 Jan 2026). The input embedding for sequence position $i$ is:

$E_i = W_\text{token}[t_i] + W_\text{segment}[s_i] + W_\text{pos}[i]$

where $t_i$ is the token ID, $s_i$ is the segment ID (unused for single-sentence tasks), and $W_\text{pos}[i]$ is the positional embedding.

Fine-tuning is performed using the AdamW optimizer with weight decay. The learning rate is warmed up over the first 10% of steps, then linearly decayed from $2 \times 10^{-5}$ – $5 \times 10^{-5}$ . Batch sizes of 16–32 and 3–5 epochs are standard, with early stopping on the validation set (Alaparthi et al., 2020 Gosai et al., 12 Jan 2026). The classification head is a single dense layer projectively mapping the [CLS] token’s hidden state to logits for binary cross-entropy loss:

$L = - [y \log p + (1-y) \log (1-p)]$

with $y \in \{0,1\}$ and $p$ the predicted probability (Alaparthi et al., 2020).

3. Baselines and Comparative Experiments

The main baselines include:

Unsupervised lexicon-based (Sent WordNet): Lookup each token’s polarities and aggregate; non-learned, thresholded decision.
Logistic Regression: Bag-of-words or TF-IDF vectorization, $E_i = W_\text{token}[t_i] + W_\text{segment}[s_i] + W_\text{pos}[i]$ 0-regularized logit, produces sigmoid class probabilities.
LSTM: Embedding layer, bidirectional LSTM, dense output; trained by Adam with batch size ≈64 and 10–20 epochs on CPU (Alaparthi et al., 2020).

Performance across validation/test sets is tabulated:

Model	Accuracy	Precision	Recall	F1 score
Sent WordNet	0.6308	0.6747	0.6308	0.6064
Logistic Reg.	0.8941	0.8975	0.8941	0.8941
LSTM	0.8675	0.8680	0.8675	0.8675
BERT	0.9231	0.9235	0.9231	0.9231

BERT outperforms all baselines. The relative improvement in accuracy ranges from 2.9%–29.5% over these established models (Alaparthi et al., 2020). The superior performance is attributed to BERT’s contextual, bidirectional encoding, enabling modeling of long-range dependencies and disambiguation unachievable with lexical or sequential RNN methods.

4. Evaluation Metrics and Results

Classification is evaluated by accuracy, precision, recall, and F1 score:

$E_i = W_\text{token}[t_i] + W_\text{segment}[s_i] + W_\text{pos}[i]$ 1
$E_i = W_\text{token}[t_i] + W_\text{segment}[s_i] + W_\text{pos}[i]$ 2
$E_i = W_\text{token}[t_i] + W_\text{segment}[s_i] + W_\text{pos}[i]$ 3
$E_i = W_\text{token}[t_i] + W_\text{segment}[s_i] + W_\text{pos}[i]$ 4

BERT achieves an F1 of 0.923 on IMDB, with accuracy and F1 exceeding those of prior methods (Alaparthi et al., 2020). In larger surveys, BERT-Base classifiers routinely yield ≈94.5% accuracy, and BERT-Large up to 95.8% (Gosai et al., 12 Jan 2026). Ensemble and hybrid transformer models have reported IMDB accuracy of 95.3% (Albladi et al., 14 Apr 2025).

5. Implementation Guidelines and Hardware Considerations

Key hyperparameter settings and runtime considerations are as follows:

Hardware: BERT models require at least one contemporary GPU (e.g., NVIDIA V100/T4), in contrast to baselines which can run on CPUs.
Training time: Approximately 100 minutes for 5 epochs, batch size 16 on 17,500 reviews.
Parameter recommendations: Learning rate 2–5 ×10⁻⁵, warm-up for 10% of steps, batch size 16–32, 3–5 epochs, early stopping based on development set, max sequence length 256 or 512 depending on resource and context requirements, weight decay 0.01, dropout 0.1 applied on attention and feedforward layers (Alaparthi et al., 2020 Gosai et al., 12 Jan 2026).
Development practices: Fixed random seeds for reproducibility, mixed-precision training for speed, gradient accumulation for memory-bound scenarios.

6. Practical Recommendations and Limitations

BERT’s improvements are realized with minimal engineering beyond standard fine-tuning recipes. Its contextual encoding is particularly effective at handling review-specific challenges, including sarcasm, negation, and complex compositionality. For edge cases (e.g., long documents exceeding 512 tokens, domain shifts, resource constraints), the literature recommends exploring chunking approaches, hierarchical models, or continual/robust training paradigms (Gosai et al., 12 Jan 2026).

Nonetheless, certain limitations persist:

LSTM baselines may underperform logistic regression on single-document IMDB reviews, suggesting relatively limited sequential advantage in this task (Alaparthi et al., 2020).
Without proper validation splits, reported accuracy may reflect overfitting or data leakage rather than genuine generalization (Yadav et al., 2023).
Sentiment signals expressed via subtle figurative language or uncommon constructs may still elude token-level modeling.

7. Outlook and Future Directions

Current research priorities include:

Domain Robustness: Distributionally robust optimization (DRO) to improve BERT performance under distributional shift (e.g., training on IMDB, testing on Rotten Tomatoes) can recover up to 8–9 percentage points in out-of-domain accuracy (Li et al., 2021).
Hybrid Methods: Incorporating user/product context via cross-attention modules, leveraging historical reviews, or using graph convolutional networks with BERT-derived embeddings yields further gains, particularly in low-resource settings (Lyu et al., 2022 Tran et al., 2021).
Fine-grained and Multimodal Sentiment: Enhanced architectures (e.g., BERT+BiLSTM, RoBERTa-BiLSTM) achieve up to 97.67% test accuracy on binary IMDB splits and higher performance in fine-grained tasks (Nkhata et al., 28 Feb 2025 Nkhata et al., 26 Feb 2025 Rahman et al., 2024).
Explainability, Fairness, and Continual Learning: Integrated Gradients, LIME, and attention-based methods are proposed for interpretability, while fairness audits and adversarial robustness are active research areas (Gosai et al., 12 Jan 2026).

The undisputed superiority of pre-trained, fine-tuned BERT for IMDB sentiment analysis is empirically established. As new hybrid architectures, data augmentation techniques, and robustness-focused frameworks are developed, performance on standard benchmarks continues to improve while maintaining generalization under real-world conditions (Alaparthi et al., 2020 Gosai et al., 12 Jan 2026).