LogicGraph Perturbation Protocol
- LogicGraph Perturbation Protocol is a structured framework that formalizes reasoning chains as graphs to inject controlled, plausible textual hallucinations.
- It leverages typed graph representations and probability-weighted perturbation operators to quantify error propagation and self-correction in multimodal inference.
- Empirical evaluations, including the use of Active Visual-Context Refinement, demonstrate reduced hallucination persistence and improved model accuracy.
The LogicGraph Perturbation Protocol (LPP) is a systematic framework for injecting high-plausibility textual hallucinations into the chain-of-thought reasoning of large multimodal models (LMMs), enabling quantitative analysis of their capacity for self-correction under cross-modal conflicts. Leveraging a structured, typed graph representation—termed "LogicGraph"—LPP formalizes reasoning chains at the granularity of entities, relations, and attributes, and applies precise, probability-weighted perturbations to probe the robustness and flexibility of multimodal inference. This approach establishes new benchmarks for consistency analysis in multimodal video reasoning, focusing on the phenomenon of “textual inertia,” wherein models persist in erroneous textual trajectories even when discordant with visual evidence (Zhu et al., 7 Jan 2026).
1. LogicGraph: Structured Representation of Reasoning Chains
LPP operationalizes reasoning as a directed, typed graph , where:
- denotes nodes corresponding to entities, relations, and attributes in each reasoning step .
- is the edge set, capturing intra-step (entity-attribute/relation) and inter-step (sequence-preserving) dependencies.
- types each node.
- labels edges by semantic role, e.g., "has-attribute", "precedes".
- and map nodes and edges to their respective textual and visual embedding spaces (e.g., BERT and pooled frame features).
Parsing with GPT-4o isolates logical atoms per step, facilitating fine-grained manipulation and individual annotation. This explicit segregation amplifies the precision of subsequent perturbations and allows longitudinal tracing of reasoning inertia across steps.
2. Perturbation Operators: Mathematical Framework
Controlled hallucinations are injected by replacing a logical atom at step :
- Candidate Generation: GPT-4o produces candidate set , each visually incorrect yet linguistically plausible.
- Probability-Weighted Selection: For multimodal model , scoring is performed:
- : average log-probability of tokens for given history .
- : average log-probability of sentence with given .
- The selected perturbation is .
- The textual-perturbation operator is:
- Selection can be probabilistic through
where enforces deterministic selection.
Visual-only perturbation, while not central in the reference work, can be realized by swapping with mismatched frame-derived features.
3. Pipeline: Algorithmic Overview
The LPP evaluation pipeline is formalized as follows:
- Graph Construction: Chain-of-thought (CoT) reasoning text is filtered and segmented; each step is parsed, producing entity, relation, and attribute nodes, with intra- and inter-step edges added.
- Perturbation Selection: For various steps and logical atom types, nodes are identified for perturbation. GPT-4o supplies candidates, against which and are computed and selected for injection.
- Evaluation: Perturbed graphs are serialized back to text. For each, sampled continuations are drawn from . Each continuation yields a final answer and is classified behaviorally: contamination (0), passive reflection (1), explicit reflection/self-correction (2), collapse (3).
- Aggregation: Majority votes and metric computation complete the analysis cycle.
4. Quantitative Evaluation and Metrics
Outcomes are quantitatively analyzed using a suite of metrics:
- Accuracy: , task correctness post-perturbation.
- Behavior Rates: , where indexes contamination, passive reflection, explicit self-correction, and collapse.
- Self-Correction Rate: .
- Error Propagation: .
- Hallucination Amplification:
quantifies persistence of injected hallucinations.
5. Experimental Parameters and Protocols
LPP experimentation employs the STAR dataset re-formulated for open-ended QA, with a curated subset of 100 samples (50 feasibility, 50 prediction) and frame rate set at 5 fps. Models evaluated include native reasoning (Keye‐preview‐8B, Keye‐1.5‐8B, LongVILA‐7B) and prompt-driven (InternVL3‐8B, Qwen2.5‐VL‐7B) architectures. Generation uses pass@3 sampling, temperature 0.7, and maximal CoT extension of 256 tokens. Perturbations target the first three reasoning steps and all atom types, with candidate generation and behavior scoring strictly following protocol specifications.
6. Analysis of Findings: Reflection, Propagation, and Mitigation
Empirical results show explicit reflection/self-correction rates () universally below 10% for baseline LMMs, with error propagation () exceeding 60% on entity perturbations in the initial reasoning step. Passive reflection () constitutes roughly 20–30%. Error propagation abates modestly with later step perturbations, yielding marginal improvements in both accuracy and reflection rates.
Ablation results indicate decreasing hallucination token count reduces contamination () and elevates passive reflection (), with negligible effect on explicit correction. Notably, Active Visual-Context Refinement (AVCR)—a training-free scheme incorporating uncertainty-driven frame check (<check>) and reasoning history denoising (<fold>)—substantially increases explicit reflection rates (from 5% to 29% for KeyE‐preview‐8B, from 1% to 31% for Qwen2.5‐VL‐7B), while reducing error propagation and boosting accuracy (all ). Component ablation reveals both <check> and <fold> are instrumental for optimal self-correction, with respective removal degrading to ~4% (visual omitted) or ~22% (denoising omitted).
7. Significance and Prospects
LPP establishes a rigorous paradigm to diagnose and quantify “textual inertia” in LMM reasoning, enabling comparative assessment across architectures and prompting strategies. The consistently low rates of self-correction documented suggest robustness deficits in current LMMs when faced with plausibility-optimized, cross-modal reasoning perturbations. The efficacy of AVCR in stifling hallucination propagation and amplifying self-reflection underscores the impact of inference-time, visually-grounded verification. A plausible implication is that systematic graph-based interrogation and multimodal context strategies may be essential for developing future LMMs with resilient reasoning trajectories and reliable cross-modal alignment (Zhu et al., 7 Jan 2026).