Papers
Topics
Authors
Recent
Search
2000 character limit reached

Logical-CommonSenseQA Benchmark

Updated 30 January 2026
  • LOGICAL-COMMONSENSEQA is a benchmark that recasts commonsense question answering as a compositional plausibility judgment using logical operators.
  • It pairs atomic statements using operators like AND, OR, and NEITHER/NOR to systematically probe LLMs' ability to reason beyond surface-level cues.
  • Empirical findings highlight strong performance in conjunctive/disjunctive settings but reveal notable challenges in negation and mixed operator scenarios.

LOGICAL-COMMONSENSEQA is a benchmark that recasts commonsense question answering as a compositional plausibility judgment over logical pairs of atomic statements, systematically probing the ability of LLMs to perform logical reasoning beyond surface-level plausibility estimation. Unlike classic benchmarks that rely on single-label or binary (yes/no) evaluation, LOGICAL-COMMONSENSEQA requires models to determine how two independent, human-validated propositions compose under explicit logical operators: AND (conjunction), OR (disjunction), and NEITHER/NOR (joint negation), with plausibility defined quantitatively.

1. Formal Task Definition and Logical Operators

LOGICAL-COMMONSENSEQA is structured around pairs of atomic statements S1S_1, %%%%1%%%%, each representing an independent, plausible or implausible commonsense proposition. The core technical object is the plausibility function P()P(\cdot), mapping each atomic statement SS to a normalized score in [0,1][0,1] encoding the degree to which it aligns with commonsense world knowledge.

Plausibility composition is governed by three logical operators, each with explicit semantics:

  • Conjunction (AND):

P(S1ANDS2)=min(P(S1),P(S2))P(S_1 \,\mathbf{AND}\, S_2) = \min(P(S_1), P(S_2))

Both S1S_1 and S2S_2 must be independently plausible; joint plausibility is capped by the less typical statement.

  • Disjunction (OR):

P(S1ORS2)=max(P(S1),P(S2))P(S_1 \,\mathbf{OR}\, S_2) = \max(P(S_1), P(S_2))

Partial plausibility holds if either S1S_1 or S2S_2 is plausible.

  • Negation (NEITHER/NOR):

P(NEITHER  S1  NOR  S2)=1max(P(S1),P(S2))P(\mathbf{NEITHER}\;S_1\;\mathbf{NOR}\;S_2) = 1 - \max(P(S_1), P(S_2))

Joint implausibility requires both statements to lack plausibility.

Each instance presents a multiple-choice question with four labeled options (A–D), each an unordered pair (S1,S2)(S_1, S_2) under a specified operator. The benchmark preserves the original CommonsenseQA multiple-choice format but reinterprets selection as an operator-conditioned joint inference task (Junias et al., 23 Jan 2026).

2. Dataset Construction and Validation Methodology

LOGICAL-COMMONSENSEQA is derived by automated extension and transformation of the CommonsenseQA dataset:

  • Candidate Generation: 5,000 CommonsenseQA items are selected. GPT-4o-mini is used to generate 4–6 diverse atomic answer candidates per question, mixing both plausible and implausible alternatives.
  • Refinement and Pruning: Logically incoherent or trivial options are filtered out. Three plausible and four implausible candidate statements per question are retained, ensuring multi-step reasoning is required.
  • Deterministic Logical Pairing: All unordered pairs of refined atomic options are formed and programmatically labeled with one of the three operators (AND, OR, NEITHER/NOR), with an additional MIXED condition—per-instance random assignment of operators.

The benchmark totals 19,996 instances, stratified equally among operators:

Operator Instances Train Dev Test
AND 4,999 2,999 1,500 500
OR 4,999 2,999 1,500 500
NEITHER/NOR 4,999 2,999 1,500 500
MIXED 4,999 2,999 1,500 500

Human validation is performed on 5% of Stage 2 items (250 questions) using an “awareness–consensus” protocol: each atomic option is separately judged for personal belief and perceived social agreement, with adjudication of annotator disagreement. Accuracy of assigned labels on the human-validated subset:

  • AND: 89.2%
  • OR: 96.4%
  • NEITHER/NOR: 73.6%
  • MIXED: 88.4%

Inter-annotator agreement is moderate (Cohen’s κ = 0.49) (Junias et al., 23 Jan 2026).

3. Benchmark Instances and Plausibility Semantics

The benchmark instances exemplify logical composition over human-centric commonsense scenarios:

  • Conjunctive Example (AND):
    • Q: “Sammy wanted to go where the people were. Where might he go?”
    • A: “local events AND social venues” (both independently plausible)
  • Disjunctive Example (OR):
    • Q: “Where might Sammy go if he just wanted a public space?”
    • A: “local events OR empty parks” (at least one plausible)
  • Negation Example (NEITHER/NOR):
    • Q: “Sammy wanted a quiet spot away from crowds. Which is not plausible?”
    • A: “NEITHER quiet retreats NOR empty parks” (both jointly implausible)

The formal semantics enforce operator-conditional selection based on joint plausibility P(S1S2)P(S_1 ⊕ S_2) matching the targeted constraint.

4. Evaluation Paradigms and Empirical Findings

Model evaluations probe both zero-shot and targeted adaptation settings:

  • Zero-Shot Prompting: Decoder-only models (LLaMA-3.3-70B, LLaMA-3.1-8B, Qwen2.5-7B) are given only the question and composite options.
  • Few-Shot Prompting: Same LLMs, with 1–3 in-context demonstrations instantiated for the operator under test.
  • Chain-of-Thought (CoT) Prompting: For LLaMA-3.1-8B, includes explicit reasoning steps.
  • Supervised Fine-Tuning: Encoder-only (DeBERTa-v3-base) and encoder–decoder (Flan-T5-base, Entailer-11B) models trained on the complete logical dataset.

All decoding uses temperature 0.0, top-p = 0.9; outputs must be one of {A, B, C, D}.

Macro-F1 scores on the human-validated test subset (zero-shot):

Model AND OR NEITHER/NOR MIXED
LLaMA-3.3-70B 80.9 70.9 13.4 53.0
LLaMA-3.1-8B 71.9 62.2 13.1 41.8
Qwen2.5-7B 79.6 68.9 12.9 53.2

With supervised fine-tuning:

Model AND OR NEITHER/NOR MIXED
Flan-T5-base 92.8 92.4 89.2 89.6
DeBERTa-v3-base 87.6 87.2 84.8 82.4

A pronounced asymmetry is observed: LLMs perform well on conjunction and moderately on disjunction, but exhibit sharp degradation on negation (NEITHER/NOR)—even with chain-of-thought prompting. Supervised models close the gap across operators but error analysis reveals persistent single-statement dominance and systematic confusion in negation scenarios (Junias et al., 23 Jan 2026).

5. Error Analysis and Reasoning Pathologies

Error clusters illustrate fundamental model limitations:

  • Single-Statement Dominance (AND/OR): The model frequently selects the pair containing the most plausible atomic statement, disregarding required compositionality.
  • Negation Inversion (NEITHER/NOR): Under joint negation, models incorrectly pick pairs containing plausible atoms, failing to internalize the operator’s semantics—consistent across prompting regimes.
  • Intermediate MIXED Difficulty: When operators are assigned randomly (MIXED condition), performance drops but remains above pure negation, which collapses reasoning quality.

These error patterns highlight the difference between surface plausibility modeling and operator-grounded logical composition.

6. Benchmark Context and Connections

LOGICAL-COMMONSENSEQA builds on the classic CommonsenseQA single-label framework but diverges through its joint, operator-conditioned inference task. In contrast to binary yes/no QA formats as exemplified in CommonsenseQA 2.0 (Talmor et al., 2022), which probe logical skills but lack multi-way compositionality, LOGICAL-COMMONSENSEQA formalizes and quantifies joint plausibility. Benchmark connection points include:

  • KEAR’s external attention paradigm (Xu et al., 2021): Incorporates external knowledge in standard QA including CommonsenseQA, but is not designed for compositional operator inference.
  • COM² (Fang et al., 2024): Demonstrates training on multi-hop logical queries sampled from knowledge graphs using conjunction, intersection, negation, and projection, with structured natural language verbalization. This approach suggests that extending LOGICAL-COMMONSENSEQA to richer operator sets (e.g. union, set-difference, temporal, multi-hop chains) is tractable via knowledge graph sampling and LLM verbalization.
  • Contrast to SCoRE [(Zhan et al., 8 Mar 2025), not available]: LOGICAL-COMMONSENSEQA emphasizes plausibility-level logical composition rather than multi-hop scenario chains or long-chain reasoning.

7. Implications and Prospective Extensions

LOGICAL-COMMONSENSEQA foregrounds a critical limitation in contemporary LLMs: strong performance in conjunctive/disjunctive settings is driven by pattern-matching over commonsense priors, while true logical composition, especially involving negation, is not reliably internalized. The benchmark's operator-specific performance stratification clarifies this gap at scale.

Recommended technical extensions include:

  • Expansion of the operator set to exclusive OR, implication, temporal/causal logic, and universal/existential quantification.
  • Integration of explicit operator-centric supervision and auxiliary losses targeting operator semantics.
  • Neural-symbolic architectures maintaining operator-specific representations.
  • Enabling generative composition: requiring models to generate new atomic statements and compose them logically.

A plausible implication is that training protocols leveraging multi-hop, multi-operator data sampled from structured knowledge graphs—as advocated in COM² (Fang et al., 2024)—could address compositional deficits observed in LOGICAL-COMMONSENSEQA.

LOGICAL-COMMONSENSEQA thus provides a controlled, validated framework for joint-plausibility evaluation, serving both as an empirical diagnostic for model compositionality and a template for advancing the logical expressiveness of future commonsense QA systems (Junias et al., 23 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LOGICAL-COMMONSENSEQA.