Zero-Shot NLP Classification with BART-MNLI

Updated 3 February 2026

Zero-shot NLP classification is a paradigm that uses NLI to reframe text categorization as an entailment task via templated class hypotheses.
BART-large-MNLI computes entailment probabilities between input texts and hypotheses, providing robust cross-domain transfer.
Self-training iteratively refines pseudo-labels, significantly boosting accuracy on diverse benchmarks and reducing prompt sensitivity.

Zero-shot NLP classification refers to the paradigm in which a model predicts categories for examples in the absence of any labeled training data for those target classes. Instead, only class names or short textual descriptions are provided. Among off-the-shelf approaches, models fine-tuned for natural language inference (NLI) such as BART-large-MNLI have become canonical zero-shot classifiers due to their robust cross-domain transfer ability. In these frameworks, classification is reframed as an entailment task between the input text and a templated hypothesis constructed from the class label. Extensions such as self-training further refine these models by iteratively adapting to unlabeled target data. The following sections detail foundational concepts, algorithms, empirical properties, and research developments for zero-shot text classification centered on the BART-MNLI family.

1. Principle of NLI-Based Zero-Shot Classification with BART-MNLI

The BART-large-MNLI model is pretrained on natural language understanding tasks and then fine-tuned on NLI datasets, notably the Multi-Genre Natural Language Inference (MNLI) corpus. In the zero-shot classification workflow, each candidate class $c \in C$ is expressed as a natural language hypothesis using a template such as "This example is [c]." Given an input $u$ (the premise), BART-MNLI outputs probabilities for three NLI labels: Entailment, Contradiction, and Neutral.

For a pair $(u, c)$ , the entailment score $S[u, c]$ is defined as the softmax-normalized probability assigned to the Entailment label: $S[u, c] = P_{\rm entail}(u, \text{Template}(c)) = \frac{\exp(s_e)}{\exp(s_e) + \exp(s_n) + \exp(s_c)}$ Classification proceeds by selecting the class $c^*$ with the highest $S[u, c]$ . This approach enables zero-shot mapping from arbitrary text to arbitrary label sets, provided those labels can be described as natural language statements (Gera et al., 2022).

2. Iterative Self-Training for Zero-Shot Models

While BART-MNLI offers strong zero-shot transfer, unfamiliarity with the target distribution can cause unstable predictions and sub-optimal accuracy. To bridge this gap, self-training applies a refinement loop to the zero-shot model as follows:

The model predicts entailment scores $S[u, c]$ for all $(u, c)$ over the unlabeled target corpus $U$ and class set $C$ .
For each class $c$ , the $n$ examples $u$ with top “best-vs-second-best” margins (where $c$ is the top class and the margin $\delta[u] = S[u,c] - S[u,c_2]$ is maximal) are treated as pseudo-positive examples for $c$ .
Negative examples are generated by pairing those examples with randomly chosen classes $c' \ne c$ and labeling as Contradiction.
Self-training iterations repeat this process, each time updating the model parameters by fine-tuning on newly accrued pseudo-labels until performance saturates (typically after $T=2$ iterations) (Gera et al., 2022).

Pseudocode for one self-training cycle:

for t in 1..T:
    for each (u, c): S[u, c] = model.score_entailment(u, "This example is [c]")
    for each class c:
        U_c = {u: c = argmax_{c'} S[u, c']}
        select top n u in U_c by δ[u] = S[u, c] - S[u, c_2]
        Pos.add((u, c, "Entail"))
    Neg = ContrastRandom(Pos)
    D = Pos + Neg
    model = FineTune(model, D)

3. Selection Criteria and Fine-Tuning Objective

To improve label reliability when generating pseudo-labels for self-training:

Only those examples with a large best-vs-second-best margin $\delta[u, c^*]$ are considered, reducing label noise.
The training loss is the cross-entropy between the model’s predicted label distribution (Entail vs. Contradiction) and the pseudo-labeled examples: $L(\theta) = - \sum_{i=1}^{|D|} \left[ y_i \log p_\theta(\text{Entail}|x_i) + (1-y_i) \log p_\theta(\text{Contradiction}|x_i) \right]$ The Neutral label is either ignored or grouped with Contradiction for a binary objective.

Token masking (masking the token in $u$ most similar to $c$ by GloVe cosine similarity) is optionally applied to discourage trivial lexical overlap, further enhancing robustness (Gera et al., 2022).

4. Empirical Performance and Hyperparameter Choices

Evaluation on diverse benchmarks including 20 Newsgroups, AG’s News, DBPedia, Yahoo! Answers, GoEmotions, ISEAR, Amazon Reviews, and IMDB establishes that self-training consistently improves zero-shot accuracy. For BART-MNLI:

Average accuracy rises from 61.9% (iteration 0, zero-shot) to 71.7% after two self-training cycles.
Largest gains are observed on tasks with many classes (e.g., DBPedia: 74.7% → 94.1%, +19.4 points) and topical diversity (e.g., 20 Newsgroups: +18.7 points).
Gains plateau after the second self-training iteration.
Main hyperparameters: unlabeled pool $|U|$ up to 10,000, pseudo-positive $n = 1\%$ of $|U|$ , AdamW optimizer with $2 \times 10^{-5}$ learning rate, and batch size 32 (Gera et al., 2022).

5. Applications and Case Studies

BART-MNLI’s zero-shot capacity supports multiple practical use cases:

In financial NLP, BART-MNLI is used to classify market news and social media content into "bullish" or "bearish" for cryptocurrency forecasting. Each text is paired with hypothesis candidates such as "This example is bullish for Bitcoin," and corresponding entailment probabilities are used to derive real-valued sentiment indicators. These features significantly improve profit and ROC metrics in downstream trading models, and combining BART-MNLI output with specialized sentiment models gives the best performance (Gurgul et al., 2023).
Beyond finance, this approach generalizes to any task formulated as a textual entailment statement: e.g., review helpfulness, topic detection, or emotion classification.

A representative code fragment for bullish/bearish classification:

from transformers import pipeline
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
def bullish_score(text, coin="Bitcoin"):
    labels = ["bullish", "bearish"]
    ht = f"This example is {{}} for {coin}."
    out = zero_shot(text, labels, hypothesis_template=ht)
    return out["scores"][0]

(Gurgul et al., 2023)

6. Limitations and Extensions

NLI-based zero-shot classification with BART-MNLI has several limitations:

Prompt engineering: Performance hinges on careful selection of natural language templates for hypotheses. Minor template variations can impact accuracy ("prompt brittleness").
Stability: Without adaptation, the zero-shot classifier may yield unstable results on out-of-domain target data.
Performance ceiling: In some cases, self-training or prompt-robustness methods are required for optimal results.

Recent research has proposed models such as Placeholding Parallel Prediction (P³), which mitigates prompt sensitivity by aggregating token probabilities across future positions via placeholder tokens. This method yields higher accuracy and near-zero prompt sensitivity compared to standard next-token or entailment-based zero-shot approaches, without further fine-tuning or prompt engineering (Qian et al., 4 Apr 2025). Self-training, as described above, remains a general-purpose, label-efficient tool for model adaptation.

Self-training represents a plug-and-play enhancement for NLI zero-shot classifiers, requiring no manual annotation or trial-and-error. Negative sampling strategies (Contrast-random, Contrast-all) and token masking mechanisms have been empirically evaluated, with ablation studies confirming the importance of model-based confidence and informativeness of examples. Cross-task experiments suggest that domain similarity governs transfer efficacy, and that adaptation on highly divergent tasks may degrade zero-shot generalization.

Comparison with alternative zero-shot paradigms (e.g., next-token classification, generative self-consistency) indicates that NLI-based and P³-style models each offer distinct advantages in robustness, computational efficiency, and domain adaptation (Gera et al., 2022, Qian et al., 4 Apr 2025).

In summary, zero-shot NLP classification with BART-MNLI exploits NLI reformulations and model confidence to deliver high-fidelity zero-resource classification across domains, with self-training and prompt-robust methods providing further improvements in performance and reliability.