Entailment Loss: Neural & Symbolic Methods

Updated 4 February 2026

Entailment loss is a learning objective that penalizes outputs lacking logical or semantic entailment, uniting evaluation in both natural language and ILP tasks.
In sequence generation, graded loss functions like CIDEnt leverage pretrained NLI models to adjust rewards based on the degree of semantic alignment.
In ILP, replacing binary entailment with example-dependent loss enables best-first search, allowing synthesis of larger and more accurate programs.

Entailment loss refers to any learning paradigm or objective in which loss is tightly coupled to the notion of logical or semantic entailment. Common in both natural language generation and inductive logic programming (ILP), entailment loss can appear as either (i) a reward modification constraining sequence models to produce outputs logically entailed by their targets, or (ii) a mechanism in symbolic rule-learning where entailment acts as a filter or feedback signal. Recent work further generalizes this notion to continuous, graded loss functions assessing degree of entailment. This approach addresses key limitations of strict, binary entailment and supports scalable learning in both neural and symbolic domains.

1. Entailment Loss in Sequence Generation: Motivation and Formalization

Sequence-to-sequence models, such as those found in video and image captioning, historically train on token-level cross-entropy but are evaluated with corpus-level metrics (e.g., BLEU, METEOR, CIDEr) based on n-gram overlap. Such metrics do not account for logical correctness: generated outputs may receive high reward despite containing semantic errors or contradictions. The CIDEnt reward addresses this by penalizing candidate sequences that, despite textual similarity, are not logically entailed by a reference target. In the CIDEnt approach, a pretrained natural language inference (NLI) model is used to compute an entailment probability for each candidate-reference pair. The overall reward is then the standard metric minus a penalty if the candidate is not entailed by any reference (Pasunuru et al., 2017).

Mathematically, for sampled output $w^{(s)}$ and references $\{r_1, ..., r_K\}$ ,

Compute CIDEr score, $\mathrm{CIDEr}(w^{(s)})$
For each $j$ , compute the entailment probability $P(\text{entailment} \mid r_j, w^{(s)})$
Define $\mathrm{Ent}(w^{(s)}) = \max_{j} P(\text{entailment} \mid r_j, w^{(s)})$
Final reward:

$\mathrm{CIDEnt}(w^{(s)}) = \begin{cases} \mathrm{CIDEr}(w^{(s)}) - \lambda, & \text{if } \mathrm{Ent}(w^{(s)}) < \beta \ \mathrm{CIDEr}(w^{(s)}), & \text{otherwise} \end{cases}$

with $\lambda$ and $\beta$ as hyperparameters.

2. Entailment Loss in Inductive Logic Programming

Classical ILP relies on the model-theoretic entailment relation: a learned hypothesis $H$ supported by background knowledge $BK$ must entail all positive and none of the negative examples. This approach uses a binary loss:

$\mathcal{L}_{ent}(x, y) = \begin{cases} 0 & x = y \ 1 & x \ne y \end{cases}$

Such binary feedback provides no gradient signal, making search brittle and limiting scalability, especially for large, compositional programs. The Brute ILP system (Cropper et al., 2020) proposes to replace binary entailment loss with example-dependent, graded loss functions specific to the output domain, such as Levenshtein distance, Hamming distance, or Manhattan distance. This enables a best-first search guided by loss magnitude, allowing search to prefer partial solutions and incrementally improve hypotheses—resulting in the synthesis of programs 20× larger than those obtained by strict entailment-guided methods.

3. Integration Into Optimization Frameworks

In sequence generation, entailment loss is incorporated via policy-gradient reinforcement learning. Specifically, the model defines a policy $p_\theta(w)$ over output sequences, with reward $r(w^{(s)}) = \mathrm{CIDEnt}(w^{(s)})$ , and minimizes the negative expected reward:

$L_{RL}(\theta) = -\mathbb{E}_{w^{(s)} \sim p_\theta}[r(w^{(s)})]$

The REINFORCE estimator is used, with a baseline to reduce variance.

To maintain fluency and stability, the policy-gradient loss is combined with standard cross-entropy:

$L_{MIXED} = (1-\gamma)L_{XE} + \gamma L_{RL}$

where $L_{XE}$ denotes the token-level cross-entropy loss, and $\gamma$ (typically 0.9990–0.9995) is set to ensure the dominance of the cross-entropy objective during early training (Pasunuru et al., 2017).

In ILP with Brute, the loss function directly guides best-first search rather than backpropagation. The system maintains a priority queue ordered by cumulative loss and incrementally constructs programs by composing library predicates, evaluating progress on the aggregate loss across all examples (Cropper et al., 2020).

4. Empirical Analyses and Quantitative Impact

In video captioning, integrating CIDEnt rewards yields statistically significant improvements in both automatic metrics and human relevance judgments. On the MSR-VTT dataset, BLEU-4, METEOR, ROUGE-L, and CIDEr scores all increase over both cross-entropy and pure CIDEr-RL baselines: e.g., BLEU-4 rises from 38.6 (XE) to 40.5 (CIDEnt-RL), and CIDEr from 44.6 to 51.7, establishing a new state-of-the-art. Human pairwise judgments further favor CIDEnt-RL over CIDEr-RL, while fluency remains unchanged (Pasunuru et al., 2017).

In symbolic ILP, Brute (example-dependent loss) consistently outperforms strictly entailment-based systems on robot planning, string transformation, and ASCII-art synthesis (size up to 79 clauses, compared to 5 in Metagol). The continuous, graded loss enables Brute to solve more challenging tasks, learn larger programs, and exhibit competitive or superior accuracy and runtime (Cropper et al., 2020).

Task/Metric	Binary Entailment	Example-dependent Loss (Brute)
Max program size (clauses)	5 (Metagol)	79
Robot planning 10×10 (%solved)	26 ± 8.7	67 ± 9.4
String transform, 1 ex. (%acc)	69.8 ± 0.8	62.7 ± 2.5
ASCII art, $n=5$ (%solved)	0	25

5. Limitations and Open Directions

Entailment-based losses depend critically on the scope and accuracy of the entailment model. For natural language, CIDEnt assumes an entailment classifier trained on datasets like SNLI; for nontraditional or out-of-domain applications (e.g., document summarization), this creates domain mismatch and limits generalizability. The CIDEnt reward also employs hard thresholding on entailment score and a fixed penalty, which constrains flexibility. Future work may consider soft weighting or learned thresholds and joint training of the entailment and captioning models. Analogously, in ILP, the choice of loss function and search heuristics strongly influence both sample efficiency and program size. Further research may explore automatic loss design, meta-learning over losses, and joint learning of predicate libraries and program structure to maximize scalability (Pasunuru et al., 2017, Cropper et al., 2020).

6. Comparative Perspective and Thematic Synthesis

Both neural and symbolic domains demonstrate the limitations of pure entailment loss—namely, its inability to grade outputs and guide optimization or search effectively. Recent work in both areas adapts entailment from a binary constraint to a flexible, domain-sensitive loss, unifying classical symbolic induction and modern sequence modeling under a common principle: loss grounded in the semantic relationship between prediction and supervision. This generalization enables more robust, scalable, and human-aligned learning, and suggests a convergence between symbolic and differentiable approaches to logical supervision.

Markdown Report Issue Upgrade to Chat

References (2)

Reinforced Video Captioning with Entailment Rewards (2017)

Learning large logic programs by going beyond entailment (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entailment Loss.