Logical-CommonSenseQA Benchmark

Updated 30 January 2026

LOGICAL-COMMONSENSEQA is a benchmark that recasts commonsense question answering as a compositional plausibility judgment using logical operators.
It pairs atomic statements using operators like AND, OR, and NEITHER/NOR to systematically probe LLMs' ability to reason beyond surface-level cues.
Empirical findings highlight strong performance in conjunctive/disjunctive settings but reveal notable challenges in negation and mixed operator scenarios.

LOGICAL-COMMONSENSEQA is a benchmark that recasts commonsense question answering as a compositional plausibility judgment over logical pairs of atomic statements, systematically probing the ability of LLMs to perform logical reasoning beyond surface-level plausibility estimation. Unlike classic benchmarks that rely on single-label or binary (yes/no) evaluation, LOGICAL-COMMONSENSEQA requires models to determine how two independent, human-validated propositions compose under explicit logical operators: AND (conjunction), OR (disjunction), and NEITHER/NOR (joint negation), with plausibility defined quantitatively.

1. Formal Task Definition and Logical Operators

LOGICAL-COMMONSENSEQA is structured around pairs of atomic statements $S_1$ , $S_2$ , each representing an independent, plausible or implausible commonsense proposition. The core technical object is the plausibility function $P(\cdot)$ , mapping each atomic statement $S$ to a normalized score in $[0,1]$ encoding the degree to which it aligns with commonsense world knowledge.

Plausibility composition is governed by three logical operators, each with explicit semantics:

Conjunction (AND):

$P(S_1 \,\mathbf{AND}\, S_2) = \min(P(S_1), P(S_2))$

Both $S_1$ and $S_2$ must be independently plausible; joint plausibility is capped by the less typical statement.

Disjunction (OR):

$P(S_1 \,\mathbf{OR}\, S_2) = \max(P(S_1), P(S_2))$

Partial plausibility holds if either $S_1$ or $S_2$ 0 is plausible.

Negation (NEITHER/NOR):

$S_2$ 1

Joint implausibility requires both statements to lack plausibility.

Each instance presents a multiple-choice question with four labeled options (A–D), each an unordered pair $S_2$ 2 under a specified operator. The benchmark preserves the original CommonsenseQA multiple-choice format but reinterprets selection as an operator-conditioned joint inference task (Junias et al., 23 Jan 2026).

2. Dataset Construction and Validation Methodology

LOGICAL-COMMONSENSEQA is derived by automated extension and transformation of the CommonsenseQA dataset:

Candidate Generation: 5,000 CommonsenseQA items are selected. GPT-4o-mini is used to generate 4–6 diverse atomic answer candidates per question, mixing both plausible and implausible alternatives.
Refinement and Pruning: Logically incoherent or trivial options are filtered out. Three plausible and four implausible candidate statements per question are retained, ensuring multi-step reasoning is required.
Deterministic Logical Pairing: All unordered pairs of refined atomic options are formed and programmatically labeled with one of the three operators (AND, OR, NEITHER/NOR), with an additional MIXED condition—per-instance random assignment of operators.

The benchmark totals 19,996 instances, stratified equally among operators:

Operator	Instances	Train	Dev	Test
AND	4,999	2,999	1,500	500
OR	4,999	2,999	1,500	500
NEITHER/NOR	4,999	2,999	1,500	500
MIXED	4,999	2,999	1,500	500

Human validation is performed on 5% of Stage 2 items (250 questions) using an “awareness–consensus” protocol: each atomic option is separately judged for personal belief and perceived social agreement, with adjudication of annotator disagreement. Accuracy of assigned labels on the human-validated subset:

AND: 89.2%
OR: 96.4%
NEITHER/NOR: 73.6%
MIXED: 88.4%

Inter-annotator agreement is moderate (Cohen’s κ = 0.49) (Junias et al., 23 Jan 2026).

3. Benchmark Instances and Plausibility Semantics

The benchmark instances exemplify logical composition over human-centric commonsense scenarios:

Conjunctive Example (AND):
- Q: “Sammy wanted to go where the people were. Where might he go?”
- A: “local events AND social venues” (both independently plausible)
Disjunctive Example (OR):
- Q: “Where might Sammy go if he just wanted a public space?”
- A: “local events OR empty parks” (at least one plausible)
Negation Example (NEITHER/NOR):
- Q: “Sammy wanted a quiet spot away from crowds. Which is not plausible?”
- A: “NEITHER quiet retreats NOR empty parks” (both jointly implausible)

The formal semantics enforce operator-conditional selection based on joint plausibility $S_2$ 3 matching the targeted constraint.

4. Evaluation Paradigms and Empirical Findings

Model evaluations probe both zero-shot and targeted adaptation settings:

Zero-Shot Prompting: Decoder-only models (LLaMA-3.3-70B, LLaMA-3.1-8B, Qwen2.5-7B) are given only the question and composite options.
Few-Shot Prompting: Same LLMs, with 1–3 in-context demonstrations instantiated for the operator under test.
Chain-of-Thought (CoT) Prompting: For LLaMA-3.1-8B, includes explicit reasoning steps.
Supervised Fine-Tuning: Encoder-only (DeBERTa-v3-base) and encoder–decoder (Flan-T5-base, Entailer-11B) models trained on the complete logical dataset.

All decoding uses temperature 0.0, top-p = 0.9; outputs must be one of {A, B, C, D}.

Macro-F1 scores on the human-validated test subset (zero-shot):

Model	AND	OR	NEITHER/NOR	MIXED
LLaMA-3.3-70B	80.9	70.9	13.4	53.0
LLaMA-3.1-8B	71.9	62.2	13.1	41.8
Qwen2.5-7B	79.6	68.9	12.9	53.2

With supervised fine-tuning:

Model	AND	OR	NEITHER/NOR	MIXED
Flan-T5-base	92.8	92.4	89.2	89.6
DeBERTa-v3-base	87.6	87.2	84.8	82.4

A pronounced asymmetry is observed: LLMs perform well on conjunction and moderately on disjunction, but exhibit sharp degradation on negation (NEITHER/NOR)—even with chain-of-thought prompting. Supervised models close the gap across operators but error analysis reveals persistent single-statement dominance and systematic confusion in negation scenarios (Junias et al., 23 Jan 2026).

5. Error Analysis and Reasoning Pathologies

Error clusters illustrate fundamental model limitations:

Single-Statement Dominance (AND/OR): The model frequently selects the pair containing the most plausible atomic statement, disregarding required compositionality.
Negation Inversion (NEITHER/NOR): Under joint negation, models incorrectly pick pairs containing plausible atoms, failing to internalize the operator’s semantics—consistent across prompting regimes.
Intermediate MIXED Difficulty: When operators are assigned randomly (MIXED condition), performance drops but remains above pure negation, which collapses reasoning quality.

These error patterns highlight the difference between surface plausibility modeling and operator-grounded logical composition.

6. Benchmark Context and Connections

LOGICAL-COMMONSENSEQA builds on the classic CommonsenseQA single-label framework but diverges through its joint, operator-conditioned inference task. In contrast to binary yes/no QA formats as exemplified in CommonsenseQA 2.0 (Talmor et al., 2022), which probe logical skills but lack multi-way compositionality, LOGICAL-COMMONSENSEQA formalizes and quantifies joint plausibility. Benchmark connection points include:

KEAR’s external attention paradigm (Xu et al., 2021): Incorporates external knowledge in standard QA including CommonsenseQA, but is not designed for compositional operator inference.
COM² (Fang et al., 2024): Demonstrates training on multi-hop logical queries sampled from knowledge graphs using conjunction, intersection, negation, and projection, with structured natural language verbalization. This approach suggests that extending LOGICAL-COMMONSENSEQA to richer operator sets (e.g. union, set-difference, temporal, multi-hop chains) is tractable via knowledge graph sampling and LLM verbalization.
Contrast to SCoRE [(Zhan et al., 8 Mar 2025), not available]: LOGICAL-COMMONSENSEQA emphasizes plausibility-level logical composition rather than multi-hop scenario chains or long-chain reasoning.

7. Implications and Prospective Extensions

LOGICAL-COMMONSENSEQA foregrounds a critical limitation in contemporary LLMs: strong performance in conjunctive/disjunctive settings is driven by pattern-matching over commonsense priors, while true logical composition, especially involving negation, is not reliably internalized. The benchmark's operator-specific performance stratification clarifies this gap at scale.

Recommended technical extensions include:

Expansion of the operator set to exclusive OR, implication, temporal/causal logic, and universal/existential quantification.
Integration of explicit operator-centric supervision and auxiliary losses targeting operator semantics.
Neural-symbolic architectures maintaining operator-specific representations.
Enabling generative composition: requiring models to generate new atomic statements and compose them logically.

A plausible implication is that training protocols leveraging multi-hop, multi-operator data sampled from structured knowledge graphs—as advocated in COM² (Fang et al., 2024)—could address compositional deficits observed in LOGICAL-COMMONSENSEQA.

LOGICAL-COMMONSENSEQA thus provides a controlled, validated framework for joint-plausibility evaluation, serving both as an empirical diagnostic for model compositionality and a template for advancing the logical expressiveness of future commonsense QA systems (Junias et al., 23 Jan 2026).