Logical-CommonSenseQA Benchmark
- LOGICAL-COMMONSENSEQA is a benchmark that recasts commonsense question answering as a compositional plausibility judgment using logical operators.
- It pairs atomic statements using operators like AND, OR, and NEITHER/NOR to systematically probe LLMs' ability to reason beyond surface-level cues.
- Empirical findings highlight strong performance in conjunctive/disjunctive settings but reveal notable challenges in negation and mixed operator scenarios.
LOGICAL-COMMONSENSEQA is a benchmark that recasts commonsense question answering as a compositional plausibility judgment over logical pairs of atomic statements, systematically probing the ability of LLMs to perform logical reasoning beyond surface-level plausibility estimation. Unlike classic benchmarks that rely on single-label or binary (yes/no) evaluation, LOGICAL-COMMONSENSEQA requires models to determine how two independent, human-validated propositions compose under explicit logical operators: AND (conjunction), OR (disjunction), and NEITHER/NOR (joint negation), with plausibility defined quantitatively.
1. Formal Task Definition and Logical Operators
LOGICAL-COMMONSENSEQA is structured around pairs of atomic statements , %%%%1%%%%, each representing an independent, plausible or implausible commonsense proposition. The core technical object is the plausibility function , mapping each atomic statement to a normalized score in encoding the degree to which it aligns with commonsense world knowledge.
Plausibility composition is governed by three logical operators, each with explicit semantics:
- Conjunction (AND):
Both and must be independently plausible; joint plausibility is capped by the less typical statement.
- Disjunction (OR):
Partial plausibility holds if either or is plausible.
- Negation (NEITHER/NOR):
Joint implausibility requires both statements to lack plausibility.
Each instance presents a multiple-choice question with four labeled options (A–D), each an unordered pair under a specified operator. The benchmark preserves the original CommonsenseQA multiple-choice format but reinterprets selection as an operator-conditioned joint inference task (Junias et al., 23 Jan 2026).
2. Dataset Construction and Validation Methodology
LOGICAL-COMMONSENSEQA is derived by automated extension and transformation of the CommonsenseQA dataset:
- Candidate Generation: 5,000 CommonsenseQA items are selected. GPT-4o-mini is used to generate 4–6 diverse atomic answer candidates per question, mixing both plausible and implausible alternatives.
- Refinement and Pruning: Logically incoherent or trivial options are filtered out. Three plausible and four implausible candidate statements per question are retained, ensuring multi-step reasoning is required.
- Deterministic Logical Pairing: All unordered pairs of refined atomic options are formed and programmatically labeled with one of the three operators (AND, OR, NEITHER/NOR), with an additional MIXED condition—per-instance random assignment of operators.
The benchmark totals 19,996 instances, stratified equally among operators:
| Operator | Instances | Train | Dev | Test |
|---|---|---|---|---|
| AND | 4,999 | 2,999 | 1,500 | 500 |
| OR | 4,999 | 2,999 | 1,500 | 500 |
| NEITHER/NOR | 4,999 | 2,999 | 1,500 | 500 |
| MIXED | 4,999 | 2,999 | 1,500 | 500 |
Human validation is performed on 5% of Stage 2 items (250 questions) using an “awareness–consensus” protocol: each atomic option is separately judged for personal belief and perceived social agreement, with adjudication of annotator disagreement. Accuracy of assigned labels on the human-validated subset:
- AND: 89.2%
- OR: 96.4%
- NEITHER/NOR: 73.6%
- MIXED: 88.4%
Inter-annotator agreement is moderate (Cohen’s κ = 0.49) (Junias et al., 23 Jan 2026).
3. Benchmark Instances and Plausibility Semantics
The benchmark instances exemplify logical composition over human-centric commonsense scenarios:
- Conjunctive Example (AND):
- Q: “Sammy wanted to go where the people were. Where might he go?”
- A: “local events AND social venues” (both independently plausible)
- Disjunctive Example (OR):
- Q: “Where might Sammy go if he just wanted a public space?”
- A: “local events OR empty parks” (at least one plausible)
- Negation Example (NEITHER/NOR):
- Q: “Sammy wanted a quiet spot away from crowds. Which is not plausible?”
- A: “NEITHER quiet retreats NOR empty parks” (both jointly implausible)
The formal semantics enforce operator-conditional selection based on joint plausibility matching the targeted constraint.
4. Evaluation Paradigms and Empirical Findings
Model evaluations probe both zero-shot and targeted adaptation settings:
- Zero-Shot Prompting: Decoder-only models (LLaMA-3.3-70B, LLaMA-3.1-8B, Qwen2.5-7B) are given only the question and composite options.
- Few-Shot Prompting: Same LLMs, with 1–3 in-context demonstrations instantiated for the operator under test.
- Chain-of-Thought (CoT) Prompting: For LLaMA-3.1-8B, includes explicit reasoning steps.
- Supervised Fine-Tuning: Encoder-only (DeBERTa-v3-base) and encoder–decoder (Flan-T5-base, Entailer-11B) models trained on the complete logical dataset.
All decoding uses temperature 0.0, top-p = 0.9; outputs must be one of {A, B, C, D}.
Macro-F1 scores on the human-validated test subset (zero-shot):
| Model | AND | OR | NEITHER/NOR | MIXED |
|---|---|---|---|---|
| LLaMA-3.3-70B | 80.9 | 70.9 | 13.4 | 53.0 |
| LLaMA-3.1-8B | 71.9 | 62.2 | 13.1 | 41.8 |
| Qwen2.5-7B | 79.6 | 68.9 | 12.9 | 53.2 |
With supervised fine-tuning:
| Model | AND | OR | NEITHER/NOR | MIXED |
|---|---|---|---|---|
| Flan-T5-base | 92.8 | 92.4 | 89.2 | 89.6 |
| DeBERTa-v3-base | 87.6 | 87.2 | 84.8 | 82.4 |
A pronounced asymmetry is observed: LLMs perform well on conjunction and moderately on disjunction, but exhibit sharp degradation on negation (NEITHER/NOR)—even with chain-of-thought prompting. Supervised models close the gap across operators but error analysis reveals persistent single-statement dominance and systematic confusion in negation scenarios (Junias et al., 23 Jan 2026).
5. Error Analysis and Reasoning Pathologies
Error clusters illustrate fundamental model limitations:
- Single-Statement Dominance (AND/OR): The model frequently selects the pair containing the most plausible atomic statement, disregarding required compositionality.
- Negation Inversion (NEITHER/NOR): Under joint negation, models incorrectly pick pairs containing plausible atoms, failing to internalize the operator’s semantics—consistent across prompting regimes.
- Intermediate MIXED Difficulty: When operators are assigned randomly (MIXED condition), performance drops but remains above pure negation, which collapses reasoning quality.
These error patterns highlight the difference between surface plausibility modeling and operator-grounded logical composition.
6. Benchmark Context and Connections
LOGICAL-COMMONSENSEQA builds on the classic CommonsenseQA single-label framework but diverges through its joint, operator-conditioned inference task. In contrast to binary yes/no QA formats as exemplified in CommonsenseQA 2.0 (Talmor et al., 2022), which probe logical skills but lack multi-way compositionality, LOGICAL-COMMONSENSEQA formalizes and quantifies joint plausibility. Benchmark connection points include:
- KEAR’s external attention paradigm (Xu et al., 2021): Incorporates external knowledge in standard QA including CommonsenseQA, but is not designed for compositional operator inference.
- COM² (Fang et al., 2024): Demonstrates training on multi-hop logical queries sampled from knowledge graphs using conjunction, intersection, negation, and projection, with structured natural language verbalization. This approach suggests that extending LOGICAL-COMMONSENSEQA to richer operator sets (e.g. union, set-difference, temporal, multi-hop chains) is tractable via knowledge graph sampling and LLM verbalization.
- Contrast to SCoRE [(Zhan et al., 8 Mar 2025), not available]: LOGICAL-COMMONSENSEQA emphasizes plausibility-level logical composition rather than multi-hop scenario chains or long-chain reasoning.
7. Implications and Prospective Extensions
LOGICAL-COMMONSENSEQA foregrounds a critical limitation in contemporary LLMs: strong performance in conjunctive/disjunctive settings is driven by pattern-matching over commonsense priors, while true logical composition, especially involving negation, is not reliably internalized. The benchmark's operator-specific performance stratification clarifies this gap at scale.
Recommended technical extensions include:
- Expansion of the operator set to exclusive OR, implication, temporal/causal logic, and universal/existential quantification.
- Integration of explicit operator-centric supervision and auxiliary losses targeting operator semantics.
- Neural-symbolic architectures maintaining operator-specific representations.
- Enabling generative composition: requiring models to generate new atomic statements and compose them logically.
A plausible implication is that training protocols leveraging multi-hop, multi-operator data sampled from structured knowledge graphs—as advocated in COM² (Fang et al., 2024)—could address compositional deficits observed in LOGICAL-COMMONSENSEQA.
LOGICAL-COMMONSENSEQA thus provides a controlled, validated framework for joint-plausibility evaluation, serving both as an empirical diagnostic for model compositionality and a template for advancing the logical expressiveness of future commonsense QA systems (Junias et al., 23 Jan 2026).