PromptBERT: Enhanced Sentence Embeddings
- PromptBERT is a prompt-based framework that re-casts sentence encoding as a masked language modeling task to overcome vanilla BERT limitations.
- It addresses BERT’s static-token embedding bias and ineffective deep transformer layers by optimizing prompt templates via manual search and continuous tuning.
- By applying contrastive training with template bias subtraction, PromptBERT achieves significant performance gains in both unsupervised and supervised settings.
PromptBERT is a prompt-based framework for learning improved sentence embeddings from pretrained BERT and RoBERTa models by re-casting sentence encoding as a masked language modeling (MLM) task. PromptBERT addresses major limitations of vanilla BERT sentence representations, introduces dedicated prompt search strategies, and employs a contrastive training objective with template denoising, yielding substantial gains over competitive baselines in both unsupervised and supervised settings (Jiang et al., 2022).
1. Limitations of Vanilla BERT Sentence Embeddings
PromptBERT isolates two primary shortcomings in vanilla BERT for sentence-level semantics:
- Static-token embedding bias: BERT’s token embedding matrix is tied to its MLM prediction head, resulting in high-frequency tokens (such as “the”, “.”), subword units (e.g., “##ing”), uppercase tokens, and punctuation disproportionately dominating the [MASK] predictions. Empirically, removing the 36 most frequent tokens, all subword-prefixed tokens, uppercase tokens, or punctuation before averaging static embeddings lifts Semantic Textual Similarity (STS) performance above both GloVe and BERT post-processing schemes (BERT-flow, whitening).
- Ineffective deep transformer layers: Contrary to expectations, averaging final-layer hidden states in BERT yields poorer semantic similarity measures compared to averaging static embeddings. This suggests that deeper layers degrade, rather than enrich, semantic similarity encoding despite producing more isotropic representations (Jiang et al., 2022).
2. Prompt-Based Sentence Encoding Schemes
PromptBERT encodes a sentence via a fill-in-the-blank prompt template, where slots the sentence and is the prediction target, generating a prompt string . The model processes , extracting the final hidden vector at the position . Two representation strategies are examined:
- Single vector (default): , leveraging the full transformer stack without static-embedding averaging.
- Weighted top- tokens: Computes a soft average over the most probable tokens under MLM-softmax (probability ), weighting each static embedding . Formally,
This strategy is empirically inferior due to re-introduction of static-token bias and greater fine-tuning difficulty. Method 2.1 is preferred (Jiang et al., 2022).
3. Prompt Template Discovery and Optimization
The quality of the prompt template is critical. PromptBERT investigates three template search paradigms:
- Manual (greedy) search: Templates are decomposed into prefix and relationship tokens (e.g., = “This sentence: ”, = “means”). A greedy procedure selects maximizing Spearman correlation on held-out STS development data, then does likewise for given the optimal :
r_candidates = [“is”, “means”, “means what”, ...] best_r = argmax_r Spearman(encode_with “[X] r [MASK].”) p_candidates = [“This [X]”, “This sentence of [X]”, “This sentence: ‘[X]’”, ...] best_p = argmax_p Spearman(encode_with “p best_r [MASK].”) Output template = best_p + “ ” + best_r + “ [MASK].”
The optimal template, “This sentence : “[X]” means [MASK].”, improves STS-B dev Spearman from ~39 to ~73.
- Automatic T5-based generation: Templates are generated from definition pairs (DefSent style), e.g., “Also called [MASK]. [X]”, with ~500 candidates evaluated. Best performance reaches ~64 Spearman, underperforming manual search.
- Continuous prompt tuning (OptiPrompt): A sequence of virtual tokens, initialized with an effective manual template, is optimized under the unsupervised contrastive objective. The base model is frozen. Performance increases from ~73 (manual) to ~81 (continuous) Spearman on STS-B dev (Jiang et al., 2022).
4. Contrastive Training with Template Denoising
Motivation and Approach
PromptBERT seeks more semantically robust positive pairs than augmenting a single template with dropout. The strategy is to encode the same sentence with two distinct templates and as and , respectively. Each template imparts a “bias vector” when applied to an empty input slot.
Template Bias Estimation and Denoised Embedding
For template , the template bias is estimated by evaluating BERT on with replaced by a placeholder (or using only the template tokens). The denoised sentence encoding is
where is the hidden state of with in , and is the template bias from empty.
Contrastive Loss
For sentences encoded with templates and , the loss for each is
where denotes cosine similarity, and is a temperature hyperparameter ().
Impact on Supervised-Unsupervised Gap
This denoising procedure, by pairing different templates and subtracting template-specific artifacts, brings unsupervised PromptBERT’s STS average performance within ∼3 points of its supervised counterpart, compared to a ∼7-point gap for baselines (Jiang et al., 2022).
5. Empirical Performance and Training Details
Datasets and Baselines
PromptBERT is evaluated on seven STS tasks (STS-12 to STS-16, STS-B, SICK-R) via SentEval, with additional transfer tasks (MR, CR, SUBJ, MPQA, SST-2, TREC, MRPC). Baselines span non-fine-tuned (GloVe-average, BERT-flow, BERT-whitening), unsupervised fine-tuned (IS-BERT, SimCSE, ConSERT), and supervised fine-tuned (InferSent, USE, SBERT, SimCSE).
Training Hyperparameters
- Unsupervised: batch size 256, learning rate 1e-5, 1 epoch, .
- Supervised: batch size 512, learning rate 5e-5, 3 epochs.
- Templates use manual prompts chosen as described in Section 3.
Key Quantitative Results
| Setting | BERT base | RoBERTa base |
|---|---|---|
| SimCSE, unsupervised | 76.25 | 76.57 |
| PromptBERT/PromptRoBERTa, unsup | 78.54 (+2.29) | 79.15 (+2.58) |
| SimCSE, supervised | 87.16 | — |
| PromptBERT, supervised | 87.60 (+0.44) | — |
PromptBERT consistently yields 1–2 point lifts in downstream transfer accuracy (e.g., MR +2.8) (Jiang et al., 2022).
6. Ablations and Analysis
- Training objective ablation (10 random seeds): Same-template dropout (SimCSE style) yields 78.16±0.17 (BERT) and 78.16±0.44 (RoBERTa), different-template without denoising gives 78.19±0.29 (BERT) and 78.17±0.44 (RoBERTa), while PromptBERT with denoising achieves 78.54±0.15 (BERT) and 79.15±0.25 (RoBERTa).
- Stability: SimCSE-BERT base over 10 runs has a mean of 75.42 and a max–min gap of 3.14; PromptBERT is more stable (mean 78.54, gap 0.53).
- Template choice: Manual greedy templates are strongest among discrete prompts; continuous tuning (OptiPrompt) can further enhance performance (to ~80+ Spearman); automatic T5-based templates underperform manual search.
- Prompt denoising: Removes spurious top-5 MLM predictions and enriches semantic content. For weighted-average embedding (see Section 2), denoising lifts STS-B average from 56.19 to 60.39, confirming the utility of denoising for representations (Jiang et al., 2022).
7. Broader Implications and Innovation
PromptBERT’s principal innovation is recasting sentence encoding as a prompt-based MLM problem, mitigating static-token bias and leveraging all transformer layers. The use of different templates plus template bias subtraction in the contrastive objective enables robust, label-free sentence representations. Empirical results demonstrate narrowing of the unsupervised-supervised performance gap and improved tunability and stability relative to existing contrastive methods. A plausible implication is that prompt-orchestrated objectives could generalize to other transfer and adaptation settings for large pretrained LLMs (Jiang et al., 2022).