NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals

Published 12 Feb 2025 in cs.CL | (2502.08080v2)

Abstract: Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel atomic hypothesis decomposition method that breaks down NLI tasks into precise sub-problems for detailed logical analysis.
The paper demonstrates that large language models, while excelling in overall NLI accuracy, often falter in maintaining inferential consistency across atomic propositions.
The paper highlights the need for refined benchmarks and training approaches to enhance models' robust understanding of logical inferences.

Analyzing Atomic Hypothesis Decomposition in Natural Language Inference

This paper presents a detailed study on the application of atomic hypothesis decomposition to natural language inference (NLI), specifically focusing on traditional NLI and defeasible NLI. The authors introduce a novel methodology where hypotheses in these tasks are decomposed into atomic propositions, which form granular sub-problems. This decomposition allows researchers to dissect the logical structure of NLI tasks, assess model consistency, and evaluate dataset diversity.

The primary contribution of this research is the examination of LLMs' (LLMs) performance on these atomic sub-problems. Despite the high accuracy of LLMs in some benchmarks, the study highlights that these models often struggle with maintaining logical consistency when dealing with atomic propositions. For example, the models demonstrated notable inconsistency between atomic sub-problems and overall NLI predictions, particularly when predictions were incorrect. This inconsistency suggests a gap in models' holistic understanding of inferential reasoning.

In exploring defeasible NLI, the authors introduce the concept of critical atomic sub-problems. Critical atoms represent the primary inferences evaluated in an NLI task and can be used to measure inferential consistency across various contexts. This paper shows that while some models excel in full NLI tasks, they often perform worse on atomic sub-problems, indicating that these models may rely more on contextual cues rather than deeply understanding the atomic inferences.

One of the key takeaways is the introduction of inferential consistency as a measure of model robustness. By evaluating how models handle multiple contexts sharing the same critical inference, the paper provides insights into the limitations of current models and suggests areas for further research. This metric is particularly useful for identifying whether a model has internalized certain types of knowledge, reducing its susceptibility to context-specific variations.

The implications of this work are both theoretical and practical. Theoretically, the findings challenge the efficacy of existing models in understanding complex reasoning tasks and highlight the need for more robust evaluation metrics. Practically, incorporating atomic decomposition into dataset design can lead to improved benchmarks that better capture the nuances of natural language understanding.

Future research should focus on improving models' ability to consistently handle atomic inferences across varying contexts. This could involve developing more sophisticated training paradigms that emphasize understanding over memorization. Additionally, exploring the generation of more diverse datasets that accurately reflect the wide range of inferential contexts present in human reasoning could significantly advance the field.

In conclusion, this paper paves the way for a more fine-grained evaluation of NLI models, offering a framework that goes beyond traditional accuracy metrics and explores the models' understanding of inferential consistency. This approach has the potential to significantly influence future developments in AI-driven language understanding systems.