Obfus FOL: Logical Entailment Under Obfuscation
- Obfus FOL is a diagnostic benchmark that tests first-order logical entailment by applying equivalence-preserving obfuscations to premises.
- It utilizes the Logifus framework to apply controlled transformations such as De Morgan's laws and contraposition, challenging LLMs beyond pattern matching.
- Empirical results show that state-of-the-art LLMs experience up to a 42% accuracy drop under obfuscation, highlighting gaps in genuine deductive reasoning.
Obfus FOL is a diagnostic benchmark for evaluating first-order logical entailment under systematic logical obfuscation, designed to distinguish genuine reasoning capability from superficial pattern matching in LLMs. It isolates the challenge of recognizing entailment when premise statements are rewritten using equivalence-preserving transformations that substantially alter their surface form without changing underlying logical content. Performance on Obfus FOL therefore directly tests the invariance of model reasoning to argument obfuscation and the robustness of internal logical representations (Borah et al., 1 Feb 2026).
1. Formal Definition and Task Structure
Obfus FOL consists of pairs , where is a set of obfuscated first-order logic (FOL) premises, and is a conclusion. Each base problem is defined as , with and truth relation . An obfuscation function is applied to obtain , ensuring
The model is tasked to predict whether logically follows from (binary True/False). Since is guaranteed to be logically equivalent to , a reduction in accuracy signifies the model's sensitivity to syntactic presentation rather than logical content.
2. Logifus Obfuscation Mechanisms
Obfuscation in Obfus FOL is operationalized via Logifus, a framework applying controlled chains of equivalence-preserving rewrites to FOL premises. Core rule schemas include:
- Conditional ↔ Disjunction (Material Implication):
- De Morgan's Laws:
- Double Negation:
- Contraposition:
- Quantifier Replacement:
- Additional transformations: Distribution, biconditional expansion, introduction of redundant tautologies, absorption, and quantifier commutation.
A typical obfuscation involves chaining at least four such steps. For example, the premise may be rewritten, via nested transformations, into a form such as
This process obfuscates the original logical structure while preserving semantic entailment.
3. Dataset Construction and Verification
The Obfus FOL subset comprises 272 questions, derived from FOLIO (Han et al. 2022) and filtered for well-formed True/False entailments. The construction procedure involves:
- Content simplification: Premises are simplified in natural language using GPT-4o; dual human annotation produced high agreement (Cohen’s ).
- Formal verification: Equivalence of is confirmed by the Prover9 automated theorem prover.
- Obfuscation and validation: Logifus transforms into , with two annotators independently verifying logical equivalence ().
Obfuscation applies nested use of the aforementioned rule set.
4. Evaluation Paradigm
A selection of state-of-the-art models—reasoning-focused (GPT-5, o4-mini, Gemini 2.5 Pro, Qwen QwQ-32B), general-purpose (GPT-4o, Claude 3.7 Sonnet), and lower-parameter baselines—are evaluated under various prompt structures: zero-shot, few-shot (4 exemplars), and chain-of-thought (CoT), each with . The key metric is Exact-Match (EM) Accuracy:
Here, norm standardizes case, punctuation, and whitespace.
5. Empirical Results and Error Taxonomy
Observed model performance degrades sharply under obfuscation:
| Model | Base EM | Obfus EM | ∆ (pp) |
|---|---|---|---|
| GPT-5 | 0.98 | 0.56 | –42 |
| GPT-5 (few-shot) | 0.87 | 0.59 | –28 |
| GPT-5 (CoT) | 0.82 | 0.61 | –21 |
| o4-mini | 0.92 | 0.68 | –24 |
| Qwen QwQ-32B | 0.77 | 0.61 | –16 |
| Claude 3.7 | 0.82 | 0.69 | –13 |
Major sources of failure include:
- Nested Negations: LLMs misinterpret constructs such as “it is not the case that if … then …,” failing to track scope and polarity.
- Implication Direction: Confusion between necessary and sufficient conditions, especially under contraposition or material implication rewrites.
- Quantifier Scope Errors: Mismanagement of vs , especially after quantifier replacement transformations.
Illustratively, in a case involving nested obfuscation of the premises “If a game sells > 1 million copies, it is in the Top 10,” GPT-5 answered FALSE on an entailment where the correct answer is TRUE, due to misanalysis of necessary/sufficient conditions.
Mechanistic probes reveal:
- Memorization Sensitivity: Membership inference AUROC increases (from 49–53% base to 56–59% obfuscated), reflecting heavier dependence on learned surface patterns.
- Layer-wise Confidence Collapse: In transformer layers 28–31, next-token log-probabilities drop by over 80% on obfuscated inputs, indicating a failure of deep, multistep reasoning.
6. Insights and Future Directions
Performance on Obfus FOL demonstrates that high LLM accuracy in FOL entailment is largely attributable to surface-form pattern recognition rather than the internal reconstruction of logical proofs. Key recommendations emerging from the analysis include:
- Data Augmentation: Utilize logically obfuscated variants during pretraining and fine-tuning to promote surface-invariance.
- Hybrid Reasoning Architectures: Integrate symbolic provers (e.g., Prover9) or proof-checking modules to enforce correctness over formula rewrites.
- Dynamic Adversarial Benchmarks: Continuously expand datasets with structure-preserving obfuscations to stress-test robustness.
- Model Design: Allocate model capacity for explicit tracking of quantifier scoping and connective nesting.
The performance gap under obfuscation—up to 42 percentage points for GPT-5—signals that current LLMs lack true deductive competence, especially for reasoning tasks that require invariance under nontrivial logical transformation. Building models capable of authentic first-order reasoning remains an open challenge, demanding advances in both data construction and model architecture (Borah et al., 1 Feb 2026).