Papers
Topics
Authors
Recent
Search
2000 character limit reached

LogicGraph Perturbation Protocol

Updated 14 January 2026
  • LogicGraph Perturbation Protocol is a structured framework that formalizes reasoning chains as graphs to inject controlled, plausible textual hallucinations.
  • It leverages typed graph representations and probability-weighted perturbation operators to quantify error propagation and self-correction in multimodal inference.
  • Empirical evaluations, including the use of Active Visual-Context Refinement, demonstrate reduced hallucination persistence and improved model accuracy.

The LogicGraph Perturbation Protocol (LPP) is a systematic framework for injecting high-plausibility textual hallucinations into the chain-of-thought reasoning of large multimodal models (LMMs), enabling quantitative analysis of their capacity for self-correction under cross-modal conflicts. Leveraging a structured, typed graph representation—termed "LogicGraph"—LPP formalizes reasoning chains at the granularity of entities, relations, and attributes, and applies precise, probability-weighted perturbations to probe the robustness and flexibility of multimodal inference. This approach establishes new benchmarks for consistency analysis in multimodal video reasoning, focusing on the phenomenon of “textual inertia,” wherein models persist in erroneous textual trajectories even when discordant with visual evidence (Zhu et al., 7 Jan 2026).

1. LogicGraph: Structured Representation of Reasoning Chains

LPP operationalizes reasoning as a directed, typed graph G=(V,E,λV,λE,ftext,fvis)G = (V, E, \lambda_V, \lambda_E, f_{\text{text}}, f_{\text{vis}}), where:

  • V={vie,vir,via:i=1n}V = \{v_i^e, v_i^r, v_i^a : i=1\ldots n\} denotes nodes corresponding to entities, relations, and attributes in each reasoning step sis_i.
  • EV×VE \subset V \times V is the edge set, capturing intra-step (entity-attribute/relation) and inter-step (sequence-preserving) dependencies.
  • λV:V{Entity,Relation,Attribute}\lambda_V: V \rightarrow \{\text{Entity}, \text{Relation}, \text{Attribute}\} types each node.
  • λE:EΣE\lambda_E: E \rightarrow \Sigma_E labels edges by semantic role, e.g., "has-attribute", "precedes".
  • ftext:VERdtextf_{\text{text}}: V \cup E \rightarrow \mathbb{R}^{d_{\text{text}}} and fvis:VERdvisf_{\text{vis}}: V \cup E \rightarrow \mathbb{R}^{d_{\text{vis}}} map nodes and edges to their respective textual and visual embedding spaces (e.g., BERT and pooled frame features).

Parsing with GPT-4o isolates logical atoms per step, facilitating fine-grained manipulation and individual annotation. This explicit segregation amplifies the precision of subsequent perturbations and allows longitudinal tracing of reasoning inertia across steps.

2. Perturbation Operators: Mathematical Framework

Controlled hallucinations are injected by replacing a logical atom g{Entity,Relation,Attribute}g \in \{\text{Entity}, \text{Relation}, \text{Attribute}\} at step sis_i:

  • Candidate Generation: GPT-4o produces candidate set C={c1,...,cm}C = \{c_1, ..., c_m\}, each visually incorrect yet linguistically plausible.
  • Probability-Weighted Selection: For multimodal model PMP_M, scoring is performed:
    • Ptoken(c)P_{\text{token}}(c): average log-probability of tokens for cc given history HH.
    • Psentence(c)P_{\text{sentence}}(c): average log-probability of sentence with gcg \rightarrow c given HH.
    • The selected perturbation is c=argmaxcC12[Ptoken(c)+Psentence(c)]c^* = \arg\max_{c \in C}\,\frac{1}{2}[P_{\text{token}}(c) + P_{\text{sentence}}(c)].
  • The textual-perturbation operator is:

Ptext(G;i,g):{vig.labelc Update λV,ftext,fvisP_{\text{text}}(G; i, g): \begin{cases} v_i^g.\text{label} \leftarrow c^* \ \text{Update } \lambda_V, f_{\text{text}}, f_{\text{vis}} \end{cases}

  • Selection can be probabilistic through

Pinject(cg,H)=exp(α[Ptoken(c)+Psentence(c)])cCexp(α[Ptoken(c)+Psentence(c)])P_{\text{inject}}(c|g,H) = \frac{\exp(\alpha [P_{\text{token}}(c) + P_{\text{sentence}}(c)])}{\sum_{c' \in C} \exp(\alpha [P_{\text{token}}(c') + P_{\text{sentence}}(c')])}

where α\alpha \to \infty enforces deterministic selection.

Visual-only perturbation, while not central in the reference work, can be realized by swapping fvis(vie)f_{\text{vis}}(v_i^e) with mismatched frame-derived features.

3. Pipeline: Algorithmic Overview

The LPP evaluation pipeline is formalized as follows:

  1. Graph Construction: Chain-of-thought (CoT) reasoning text RR is filtered and segmented; each step is parsed, producing entity, relation, and attribute nodes, with intra- and inter-step edges added.
  2. Perturbation Selection: For various steps and logical atom types, nodes are identified for perturbation. GPT-4o supplies m5m \approx 5 candidates, against which PtokenP_{\text{token}} and PsentenceP_{\text{sentence}} are computed and cc^* selected for injection.
  3. Evaluation: Perturbed graphs are serialized back to text. For each, k=3k=3 sampled continuations are drawn from M(H~,Vraw)M(H̃, V_{\text{raw}}). Each continuation yields a final answer y^\hat{y} and is classified behaviorally: contamination (0), passive reflection (1), explicit reflection/self-correction (2), collapse (3).
  4. Aggregation: Majority votes and metric computation complete the analysis cycle.

4. Quantitative Evaluation and Metrics

Outcomes are quantitatively analyzed using a suite of metrics:

  • Accuracy: Acc=1Ni=1N1(y^i=yi)Acc = \frac{1}{N}\sum_{i=1}^N \mathbb{1}(\hat{y}_i = y_i), task correctness post-perturbation.
  • Behavior Rates: Rk=1Ni=1N1(bi=k)R_k = \frac{1}{N}\sum_{i=1}^N \mathbb{1}(b_i = k), where kk indexes contamination, passive reflection, explicit self-correction, and collapse.
  • Self-Correction Rate: R2=#Explicit ReflectionNR_2 = \frac{\#\,\text{Explicit Reflection}}{N}.
  • Error Propagation: R0=#Contextual ContaminationNR_0 = \frac{\#\,\text{Contextual Contamination}}{N}.
  • Hallucination Amplification:

Amp=1Nki=1Nj=1kHi,jcontinuationi,jAmp = \frac{1}{N \cdot k} \sum_{i=1}^N \sum_{j=1}^k \frac{H_{i,j}}{|\text{continuation}_{i,j}|}

quantifies persistence of injected hallucinations.

5. Experimental Parameters and Protocols

LPP experimentation employs the STAR dataset re-formulated for open-ended QA, with a curated subset of 100 samples (50 feasibility, 50 prediction) and frame rate set at 5 fps. Models evaluated include native reasoning (Keye‐preview‐8B, Keye‐1.5‐8B, LongVILA‐7B) and prompt-driven (InternVL3‐8B, Qwen2.5‐VL‐7B) architectures. Generation uses pass@3 sampling, temperature 0.7, and maximal CoT extension of 256 tokens. Perturbations target the first three reasoning steps and all atom types, with candidate generation and behavior scoring strictly following protocol specifications.

6. Analysis of Findings: Reflection, Propagation, and Mitigation

Empirical results show explicit reflection/self-correction rates (R2R_2) universally below 10% for baseline LMMs, with error propagation (R0R_0) exceeding 60% on entity perturbations in the initial reasoning step. Passive reflection (R1R_1) constitutes roughly 20–30%. Error propagation abates modestly with later step perturbations, yielding marginal improvements in both accuracy and reflection rates.

Ablation results indicate decreasing hallucination token count reduces contamination (ΔR05%\Delta R_0 \approx -5\%) and elevates passive reflection (ΔR1+5%\Delta R_1 \approx +5\%), with negligible effect on explicit correction. Notably, Active Visual-Context Refinement (AVCR)—a training-free scheme incorporating uncertainty-driven frame check (<check>) and reasoning history denoising (<fold>)—substantially increases explicit reflection rates (from 5% to 29% for KeyE‐preview‐8B, from 1% to 31% for Qwen2.5‐VL‐7B), while reducing error propagation and boosting accuracy (all p<0.01p<0.01). Component ablation reveals both <check> and <fold> are instrumental for optimal self-correction, with respective removal degrading R2R_2 to ~4% (visual omitted) or ~22% (denoising omitted).

7. Significance and Prospects

LPP establishes a rigorous paradigm to diagnose and quantify “textual inertia” in LMM reasoning, enabling comparative assessment across architectures and prompting strategies. The consistently low rates of self-correction documented suggest robustness deficits in current LMMs when faced with plausibility-optimized, cross-modal reasoning perturbations. The efficacy of AVCR in stifling hallucination propagation and amplifying self-reflection underscores the impact of inference-time, visually-grounded verification. A plausible implication is that systematic graph-based interrogation and multimodal context strategies may be essential for developing future LMMs with resilient reasoning trajectories and reliable cross-modal alignment (Zhu et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LogicGraph Perturbation Protocol.