PromptBERT: Enhanced Sentence Embeddings

Updated 21 February 2026

PromptBERT is a prompt-based framework that re-casts sentence encoding as a masked language modeling task to overcome vanilla BERT limitations.
It addresses BERT’s static-token embedding bias and ineffective deep transformer layers by optimizing prompt templates via manual search and continuous tuning.
By applying contrastive training with template bias subtraction, PromptBERT achieves significant performance gains in both unsupervised and supervised settings.

PromptBERT is a prompt-based framework for learning improved sentence embeddings from pretrained BERT and RoBERTa models by re-casting sentence encoding as a masked language modeling (MLM) task. PromptBERT addresses major limitations of vanilla BERT sentence representations, introduces dedicated prompt search strategies, and employs a contrastive training objective with template denoising, yielding substantial gains over competitive baselines in both unsupervised and supervised settings (Jiang et al., 2022).

1. Limitations of Vanilla BERT Sentence Embeddings

PromptBERT isolates two primary shortcomings in vanilla BERT for sentence-level semantics:

Static-token embedding bias: BERT’s token embedding matrix is tied to its MLM prediction head, resulting in high-frequency tokens (such as “the”, “.”), subword units (e.g., “##ing”), uppercase tokens, and punctuation disproportionately dominating the [MASK] predictions. Empirically, removing the 36 most frequent tokens, all subword-prefixed tokens, uppercase tokens, or punctuation before averaging static embeddings lifts Semantic Textual Similarity (STS) performance above both GloVe and BERT post-processing schemes (BERT-flow, whitening).
Ineffective deep transformer layers: Contrary to expectations, averaging final-layer hidden states in BERT yields poorer semantic similarity measures compared to averaging static embeddings. This suggests that deeper layers degrade, rather than enrich, semantic similarity encoding despite producing more isotropic representations (Jiang et al., 2022).

2. Prompt-Based Sentence Encoding Schemes

PromptBERT encodes a sentence $x$ via a fill-in-the-blank prompt template, where $[X]$ slots the sentence and $[MASK]$ is the prediction target, generating a prompt string $x_\text{prompt}$ . The model processes $x_\text{prompt}$ , extracting the final hidden vector at the $[MASK]$ position $h_{[\mathrm{MASK}]}$ . Two representation strategies are examined:

Single $[MASK]$ vector (default): $h = h_{[\mathrm{MASK}]}$ , leveraging the full transformer stack without static-embedding averaging.
Weighted top- $k$ tokens: Computes a soft average over the $k$ most probable tokens under MLM-softmax (probability $P(v = [\mathrm{MASK}] | h_{[\mathrm{MASK}]})$ ), weighting each static embedding $W_v$ . Formally,

$h = \frac{\sum_{v \in V_{\text{top-}k}} W_v \cdot P(v = [\mathrm{MASK}] | h_{[\mathrm{MASK}]})}{\sum_{v \in V_{\text{top-}k}} P(v = [\mathrm{MASK}] | h_{[\mathrm{MASK}]})}$

This strategy is empirically inferior due to re-introduction of static-token bias and greater fine-tuning difficulty. Method 2.1 is preferred (Jiang et al., 2022).

3. Prompt Template Discovery and Optimization

The quality of the prompt template is critical. PromptBERT investigates three template search paradigms:

Manual (greedy) search: Templates are decomposed into prefix $p$ and relationship tokens $r$ (e.g., $p$ = “This sentence: ”, $r$ = “means”). A greedy procedure selects $r$ maximizing Spearman correlation on held-out STS development data, then does likewise for $p$ given the optimal $r$ :

r_candidates = [“is”, “means”, “means what”, ...] best_r = argmax_r Spearman(encode_with “[X] r [MASK].”) p_candidates = [“This [X]”, “This sentence of [X]”, “This sentence: ‘[X]’”, ...] best_p = argmax_p Spearman(encode_with “p best_r [MASK].”) Output template = best_p + “ ” + best_r + “ [MASK].”

The optimal template, “This sentence : “[X]” means [MASK].”, improves STS-B dev Spearman from ~39 to ~73.

Automatic T5-based generation: Templates are generated from definition pairs (DefSent style), e.g., “Also called [MASK]. [X]”, with ~500 candidates evaluated. Best performance reaches ~64 Spearman, underperforming manual search.
Continuous prompt tuning (OptiPrompt): A sequence of virtual tokens, initialized with an effective manual template, is optimized under the unsupervised contrastive objective. The base model is frozen. Performance increases from ~73 (manual) to ~81 (continuous) Spearman on STS-B dev (Jiang et al., 2022).

4. Contrastive Training with Template Denoising

Motivation and Approach

PromptBERT seeks more semantically robust positive pairs than augmenting a single template with dropout. The strategy is to encode the same sentence $x$ with two distinct templates $t$ and $t'$ as $h_i$ and $h_i'$ , respectively. Each template imparts a “bias vector” when applied to an empty input slot.

Template Bias Estimation and Denoised Embedding

For template $t$ , the template bias $\hat{h}_t$ is estimated by evaluating BERT on $t$ with $[X]$ replaced by a placeholder (or using only the template tokens). The denoised sentence encoding is

$u_t(x) = h_t(x) - b_t$

where $h_t(x)$ is the hidden state of $[MASK]$ with $x$ in $[X]$ , and $b_t$ is the template bias from $[X]$ empty.

Contrastive Loss

For $N$ sentences encoded with templates $t$ and $t'$ , the loss for each $i$ is

$\ell_i = -\log\frac{\exp(\cos(u_t(x_i), u_{t'}(x_i)) / \tau)}{\sum_{j=1}^N \exp(\cos(u_t(x_i), u_{t'}(x_j)) / \tau)}$

where $\cos(\cdot,\cdot)$ denotes cosine similarity, and $\tau$ is a temperature hyperparameter ( $\tau = 0.05$ ).

Impact on Supervised-Unsupervised Gap

This denoising procedure, by pairing different templates and subtracting template-specific artifacts, brings unsupervised PromptBERT’s STS average performance within ∼3 points of its supervised counterpart, compared to a ∼7-point gap for baselines (Jiang et al., 2022).

5. Empirical Performance and Training Details

Datasets and Baselines

PromptBERT is evaluated on seven STS tasks (STS-12 to STS-16, STS-B, SICK-R) via SentEval, with additional transfer tasks (MR, CR, SUBJ, MPQA, SST-2, TREC, MRPC). Baselines span non-fine-tuned (GloVe-average, BERT-flow, BERT-whitening), unsupervised fine-tuned (IS-BERT, SimCSE, ConSERT), and supervised fine-tuned (InferSent, USE, SBERT, SimCSE).

Training Hyperparameters

Unsupervised: batch size 256, learning rate 1e-5, 1 epoch, $\tau = 0.05$ .
Supervised: batch size 512, learning rate 5e-5, 3 epochs.
Templates use manual prompts chosen as described in Section 3.

Key Quantitative Results

Setting	BERT base	RoBERTa base
SimCSE, unsupervised	76.25	76.57
PromptBERT/PromptRoBERTa, unsup	78.54 (+2.29)	79.15 (+2.58)
SimCSE, supervised	87.16	—
PromptBERT, supervised	87.60 (+0.44)	—

PromptBERT consistently yields 1–2 point lifts in downstream transfer accuracy (e.g., MR +2.8) (Jiang et al., 2022).

6. Ablations and Analysis

Training objective ablation (10 random seeds): Same-template dropout (SimCSE style) yields 78.16±0.17 (BERT) and 78.16±0.44 (RoBERTa), different-template without denoising gives 78.19±0.29 (BERT) and 78.17±0.44 (RoBERTa), while PromptBERT with denoising achieves 78.54±0.15 (BERT) and 79.15±0.25 (RoBERTa).
Stability: SimCSE-BERT base over 10 runs has a mean of 75.42 and a max–min gap of 3.14; PromptBERT is more stable (mean 78.54, gap 0.53).
Template choice: Manual greedy templates are strongest among discrete prompts; continuous tuning (OptiPrompt) can further enhance performance (to ~80+ Spearman); automatic T5-based templates underperform manual search.
Prompt denoising: Removes spurious top-5 MLM predictions and enriches semantic content. For weighted-average embedding (see Section 2), denoising lifts STS-B average from 56.19 to 60.39, confirming the utility of denoising for $[MASK]$ representations (Jiang et al., 2022).

7. Broader Implications and Innovation

PromptBERT’s principal innovation is recasting sentence encoding as a prompt-based MLM problem, mitigating static-token bias and leveraging all transformer layers. The use of different templates plus template bias subtraction in the contrastive objective enables robust, label-free sentence representations. Empirical results demonstrate narrowing of the unsupervised-supervised performance gap and improved tunability and stability relative to existing contrastive methods. A plausible implication is that prompt-orchestrated objectives could generalize to other transfer and adaptation settings for large pretrained LLMs (Jiang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

PromptBERT: Improving BERT Sentence Embeddings with Prompts (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PromptBERT.