Logifus: Logical Obfuscation Framework
- Logifus is a logical obfuscation framework that systematically rewrites reasoning tasks while preserving logical equivalence.
- It integrates both syntactic and semantic transformations—including FOL laws, graph rewrites, and pattern induction—to test LLM robustness.
- The framework underlies the LogiQAte benchmark, revealing significant performance drops in LLMs when facing logically equivalent reformulations.
Logifus is a structure-preserving logical obfuscation framework introduced to evaluate the robustness of LLMs against logically equivalent reformulations of standard reasoning tasks. Its core function is to systematically rewrite reasoning problems into diverse surface forms while maintaining logical equivalence, thereby interrogating whether LLMs genuinely understand underlying logic or merely rely on familiar patterns. The framework underpins LogiQAte, a diagnostic benchmark comprising 1,108 systematically obfuscated questions spanning four families of symbolic reasoning tasks, enabling controlled evaluation of semantic invariance in LLMs (Borah et al., 1 Feb 2026).
1. Formal Structure and Definition
Let denote a set of base reasoning questions, each with a unique answer . Logifus defines an obfuscation mapping
such that for each , its obfuscated image satisfies
i.e., is logically equivalent to (). Each question is internally represented as a formal object (e.g., FOL formula, sequence, graph); Logifus applies a sequence of equivalence-preserving rewrite rules
where each is drawn from a transformation library guaranteeing . Example FOL transformations include contraposition, De Morgan's laws, double negation, and distributivity/absorption. The obfuscation function ensures answer space and task format continuity across the base and obfuscated questions.
2. Mechanisms of Structure-Preserving Obfuscation
Logifus operates at both syntactic and semantic levels:
- Syntactic: Alters the natural language surface form (via rephrasing, lexical substitutions, or ordering) without modifying truth-conditions.
- Semantic: Applies formal equivalence transformations on the logical or graphical structure.
A canonical multi-step obfuscation in the kinship (blood relation) domain illustrates the workflow:
- Target atomic relation (e.g., ).
- Apply semantic rewrite (e.g., express as a multi-step graph chain: ).
- Inject syntactic complexity (e.g., double negation or nested conditionals).
- Parallel obfuscations of other premises via FOL transformations (e.g., De Morgan's on ).
The resulting obfuscated questions present complex surface forms yet preserve the original entailment relation.
3. Theoretical Underpinnings
Logifus's structure-preserving transformations are anchored in four domains:
- First-Order Logic (FOL) Equivalences: All rewrite rules respect logical equivalence (), leveraging established completeness and soundness results of FOL systems.
- Family-Graph Transformations: Kinship reasoning is modeled with undirected graphs. Obfuscation replaces direct edges (e.g., brother) with multi-hop, path-invariant alternatives (e.g., sister-in-law's husband), maintaining graph connectivity and relation invariants.
- Pattern-Induction Substitutions: Number-series tasks employ invertible symbolic encodings (e.g., mapping planet names, then to ASCII sums or MD5 hashes) that conceal the underlying function while permitting answer verification.
- Navigation Reference-Frame Alterations: Spatial reasoning is obfuscated by introducing self-canceling detours (vector pairs summing to zero), preserving net displacement.
No new formal theorems are introduced; instead, transformations rely on well-established logical and graph-theoretic identities. The framework informally states: each obfuscation step preserves logical entailment in both directions, thus guaranteeing equivalence.
4. LogiQAte Benchmark: Obfuscated Reasoning Tasks
Logifus forms the foundation of LogiQAte, a rigorously constructed benchmark containing 1,108 questions divided among four diagnostic tasks. Each base task is paired with systematically obfuscated variants:
| Task | Principle | Obfuscation Approach |
|---|---|---|
| Obfus FOL (272 items) | FOL entailment | ≥4 nested FOL rewrite rules (contrapositive, De Morgan, etc.), validated by prover and annotators |
| Obfus Blood Relation (300) | Family-graph inference | Replace relations with 2-hop or 5-hop kinship chains; manual validation |
| Obfus Number Series (284) | Pattern induction | Rewrite sequences via symbolic substitutions: planet names, ASCII sums, MD5 hashes |
| Obfus Direction Sense (252) | Navigation | Insert self-canceling spatial detours in walking-vector problems |
All base-obfuscation pairs underwent double annotation (Cohen’s ). Each obfuscation maintains logical equivalence and answer verifiability.
5. Evaluation Methodology
Nine LLMs were evaluated under zero-shot, few-shot, and chain-of-thought (CoT) prompting. Models included GPT-4o, Claude 3.7 Sonnet, GPT-5, o4-mini, Gemini 2.5 Pro, Qwen QwQ-32B, GPT-4o-mini, Gemma 3-27B-IT, and Llama-4-Maverick-17B. Prompts used were:
- Zero-shot: “Answer the following….”
- Few-shot: Three exemplars plus the current question.
- CoT: “Let’s think step by step.”
The key metric was exact-match accuracy: where normalization erases undesirable formatting artifacts.
The following results were observed (abbreviated for clarity):
| Task | Model | Base EM | Obf. EM | Δ% |
|---|---|---|---|---|
| Obfus FOL | GPT-5 | 0.98 | 0.56 | –42% |
| o4-mini | 0.92 | 0.68 | –26% | |
| Obfus Blood Relation | GPT-5 | 0.52 | 0.45 | –13% |
| o4-mini | 0.69 | 0.59 | –14% | |
| Obfus Number Series | GPT-5 | 0.77 | 0.60 | –17% |
| o4-mini | 0.60 | 0.57 | –5% | |
| Obfus Direction Sense | GPT-5 | 0.64 | 0.44 | –20% |
| o4-mini | 0.62 | 0.45 | –17% |
6. Empirical Analysis and Insights
Logifus-based obfuscation produces a pronounced reduction in LLM accuracy across all tasks and prompting strategies:
- Reasoning-focused models (GPT-5, o4-mini, Qwen) experience average performance drops of 25–30%.
- General-purpose LLMs (GPT-4o, Claude) incur performance reductions approaching 45–50%.
- The worst-case drop is noted for Obfus FOL (GPT-5, from 98% to 56%), while general-purpose models under Obfus Number Series drop by up to 87% (GPT-4o) and 79% (Claude).
- Even chain-of-thought prompting offers only partial robustness (e.g., GPT-5 CoT Δ = –18.7% versus –27.0% zero-shot).
Mechanistic probing reveals increased reliance on memorized patterns: memorization AUROC spikes under deep obfuscation (e.g., Obfus Blood Relation L2 reaches 82% versus 55% in base for LLaMA 3.1 8B). Late transformer layer next-token log-probabilities decrease by 50–80% for obfuscated inputs, implying disrupted representation consolidation. Despite high canonical-form accuracy, leading LLMs exhibit marked sensitivity to logically equivalent reformulations, indicating limited semantic invariance.
7. Limitations and Prospective Directions
Current limitations identified include:
- Monolinguality: LogiQAte is restricted to English; the extension to low-resource and structurally varied languages remains unaddressed.
- Limited Reasoning Diversity: Beyond FOL, kinship, number sequence, and navigation, broader categories (e.g., seating, commonsense) are not covered.
- Static Obfuscation Pipelines: Obfuscation processes are not tailored to individual model weaknesses. Incorporation of model-in-the-loop obfuscation could yield more informative robustness diagnostics.
- Training Regimes: The effect of pre-training/fine-tuning on obfuscated examples or leveraging hybrid neuro-symbolic architectures is unexplored.
A plausible implication is that robustness-oriented dataset construction, such as via Logifus, will be essential for future LLM evaluation and the development of models with deeper, representation-invariant reasoning abilities (Borah et al., 1 Feb 2026).