Logifus: Logical Obfuscation Framework

Updated 8 February 2026

Logifus is a logical obfuscation framework that systematically rewrites reasoning tasks while preserving logical equivalence.
It integrates both syntactic and semantic transformations—including FOL laws, graph rewrites, and pattern induction—to test LLM robustness.
The framework underlies the LogiQAte benchmark, revealing significant performance drops in LLMs when facing logically equivalent reformulations.

Logifus is a structure-preserving logical obfuscation framework introduced to evaluate the robustness of LLMs against logically equivalent reformulations of standard reasoning tasks. Its core function is to systematically rewrite reasoning problems into diverse surface forms while maintaining logical equivalence, thereby interrogating whether LLMs genuinely understand underlying logic or merely rely on familiar patterns. The framework underpins LogiQAte, a diagnostic benchmark comprising 1,108 systematically obfuscated questions spanning four families of symbolic reasoning tasks, enabling controlled evaluation of semantic invariance in LLMs (Borah et al., 1 Feb 2026).

1. Formal Structure and Definition

Let $Q = \{q_1, \dots, q_n\}$ denote a set of base reasoning questions, each with a unique answer $M_i$ . Logifus defines an obfuscation mapping

$f : Q \longrightarrow Q'$

such that for each $q_i$ , its obfuscated image $q'_i = f(q_i)$ satisfies

$q_i \models M_i \quad\Longleftrightarrow\quad q'_i \models M_i,$

i.e., $q'_i$ is logically equivalent to $q_i$ ( $q_i \equiv q'_i$ ). Each question is internally represented as a formal object $\Phi$ (e.g., FOL formula, sequence, graph); Logifus applies a sequence of equivalence-preserving rewrite rules

$\Phi \xrightarrow{\,r_1\,} \Phi^{(1)} \xrightarrow{\,r_2\,} \cdots \xrightarrow{\,r_k\,} \Phi'$

where each $r_j$ is drawn from a transformation library guaranteeing $\Phi^{(j-1)} \equiv \Phi^{(j)}$ . Example FOL transformations include contraposition, De Morgan's laws, double negation, and distributivity/absorption. The obfuscation function $f$ ensures answer space and task format continuity across the base and obfuscated questions.

2. Mechanisms of Structure-Preserving Obfuscation

Logifus operates at both syntactic and semantic levels:

Syntactic: Alters the natural language surface form (via rephrasing, lexical substitutions, or ordering) without modifying truth-conditions.
Semantic: Applies formal equivalence transformations on the logical or graphical structure.

A canonical multi-step obfuscation in the kinship (blood relation) domain illustrates the workflow:

Target atomic relation (e.g., $\mathit{Father}(C,F)$ ).
Apply semantic rewrite (e.g., express as a multi-step graph chain: $\exists x(\mathit{Grandfather}(x,F) \land \mathit{OnlySon}(C,x))$ ).
Inject syntactic complexity (e.g., double negation or nested conditionals).
Parallel obfuscations of other premises via FOL transformations (e.g., De Morgan's on $\neg(\neg\mathit{Wife}(D,C) \lor \neg\mathit{Spouse}(D,C))$ ).

The resulting obfuscated questions present complex surface forms yet preserve the original entailment relation.

3. Theoretical Underpinnings

Logifus's structure-preserving transformations are anchored in four domains:

First-Order Logic (FOL) Equivalences: All rewrite rules respect logical equivalence ( $\Phi \equiv \Phi'$ ), leveraging established completeness and soundness results of FOL systems.
Family-Graph Transformations: Kinship reasoning is modeled with undirected graphs. Obfuscation replaces direct edges (e.g., brother) with multi-hop, path-invariant alternatives (e.g., sister-in-law's husband), maintaining graph connectivity and relation invariants.
Pattern-Induction Substitutions: Number-series tasks employ invertible symbolic encodings (e.g., mapping $n \rightarrow$ planet names, then to ASCII sums or MD5 hashes) that conceal the underlying function while permitting answer verification.
Navigation Reference-Frame Alterations: Spatial reasoning is obfuscated by introducing self-canceling detours (vector pairs summing to zero), preserving net displacement.

No new formal theorems are introduced; instead, transformations rely on well-established logical and graph-theoretic identities. The framework informally states: each obfuscation step preserves logical entailment in both directions, thus guaranteeing equivalence.

4. LogiQAte Benchmark: Obfuscated Reasoning Tasks

Logifus forms the foundation of LogiQAte, a rigorously constructed benchmark containing 1,108 questions divided among four diagnostic tasks. Each base task is paired with systematically obfuscated variants:

Task	Principle	Obfuscation Approach
Obfus FOL (272 items)	FOL entailment	≥4 nested FOL rewrite rules (contrapositive, De Morgan, etc.), validated by prover and annotators
Obfus Blood Relation (300)	Family-graph inference	Replace relations with 2-hop or 5-hop kinship chains; manual validation
Obfus Number Series (284)	Pattern induction	Rewrite sequences via symbolic substitutions: planet names, ASCII sums, MD5 hashes
Obfus Direction Sense (252)	Navigation	Insert self-canceling spatial detours in walking-vector problems

All base-obfuscation pairs underwent double annotation (Cohen’s $\kappa > 0.88$ ). Each obfuscation maintains logical equivalence and answer verifiability.

5. Evaluation Methodology

Nine LLMs were evaluated under zero-shot, few-shot, and chain-of-thought (CoT) prompting. Models included GPT-4o, Claude 3.7 Sonnet, GPT-5, o4-mini, Gemini 2.5 Pro, Qwen QwQ-32B, GPT-4o-mini, Gemma 3-27B-IT, and Llama-4-Maverick-17B. Prompts used were:

Zero-shot: “Answer the following….”
Few-shot: Three exemplars plus the current question.
CoT: “Let’s think step by step.”

The key metric was exact-match accuracy: $\mathrm{EM} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{\mathrm{norm}(\hat{y}_i) = \mathrm{norm}(y_i)\}$ where normalization erases undesirable formatting artifacts.

The following results were observed (abbreviated for clarity):

Task	Model	Base EM	Obf. EM	Δ%
Obfus FOL	GPT-5	0.98	0.56	–42%
	o4-mini	0.92	0.68	–26%
Obfus Blood Relation	GPT-5	0.52	0.45	–13%
	o4-mini	0.69	0.59	–14%
Obfus Number Series	GPT-5	0.77	0.60	–17%
	o4-mini	0.60	0.57	–5%
Obfus Direction Sense	GPT-5	0.64	0.44	–20%
	o4-mini	0.62	0.45	–17%

(Borah et al., 1 Feb 2026)

6. Empirical Analysis and Insights

Logifus-based obfuscation produces a pronounced reduction in LLM accuracy across all tasks and prompting strategies:

Reasoning-focused models (GPT-5, o4-mini, Qwen) experience average performance drops of 25–30%.
General-purpose LLMs (GPT-4o, Claude) incur performance reductions approaching 45–50%.
The worst-case drop is noted for Obfus FOL (GPT-5, from 98% to 56%), while general-purpose models under Obfus Number Series drop by up to 87% (GPT-4o) and 79% (Claude).
Even chain-of-thought prompting offers only partial robustness (e.g., GPT-5 CoT Δ = –18.7% versus –27.0% zero-shot).

Mechanistic probing reveals increased reliance on memorized patterns: memorization AUROC spikes under deep obfuscation (e.g., Obfus Blood Relation L2 reaches 82% versus 55% in base for LLaMA 3.1 8B). Late transformer layer next-token log-probabilities decrease by 50–80% for obfuscated inputs, implying disrupted representation consolidation. Despite high canonical-form accuracy, leading LLMs exhibit marked sensitivity to logically equivalent reformulations, indicating limited semantic invariance.

7. Limitations and Prospective Directions

Current limitations identified include:

Monolinguality: LogiQAte is restricted to English; the extension to low-resource and structurally varied languages remains unaddressed.
Limited Reasoning Diversity: Beyond FOL, kinship, number sequence, and navigation, broader categories (e.g., seating, commonsense) are not covered.
Static Obfuscation Pipelines: Obfuscation processes are not tailored to individual model weaknesses. Incorporation of model-in-the-loop obfuscation could yield more informative robustness diagnostics.
Training Regimes: The effect of pre-training/fine-tuning on obfuscated examples or leveraging hybrid neuro-symbolic architectures is unexplored.

A plausible implication is that robustness-oriented dataset construction, such as via Logifus, will be essential for future LLM evaluation and the development of models with deeper, representation-invariant reasoning abilities (Borah et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Logifus.

Logifus: Logical Obfuscation Framework

1. Formal Structure and Definition

2. Mechanisms of Structure-Preserving Obfuscation

3. Theoretical Underpinnings

4. LogiQAte Benchmark: Obfuscated Reasoning Tasks

5. Evaluation Methodology

6. Empirical Analysis and Insights

7. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Logifus: Logical Obfuscation Framework

1. Formal Structure and Definition

2. Mechanisms of Structure-Preserving Obfuscation

3. Theoretical Underpinnings

4. LogiQAte Benchmark: Obfuscated Reasoning Tasks

5. Evaluation Methodology

6. Empirical Analysis and Insights

7. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research