Context Adaptation Bugs (CtxBugs)
- Context Adaptation Bugs (CtxBugs) are semantic errors arising from outdated, partial, or manipulated context that cause systems to fail under adaptation scenarios.
- Empirical studies using frameworks like CtxBugGen and buggy-HumanEval reveal that even state-of-the-art models struggle, with pass@1 rates dropping by over 20% in the presence of such bugs.
- Mitigation strategies include model-level integrity checks, improved memory management, and type-theoretic approaches to enforce robust cross-context consistency.
Context Adaptation Bugs (CtxBugs) represent a principled class of semantic errors that occur when a computational system—such as an autonomous agent, code LLM, or context-oriented program—fails to correctly adapt its behavior or outputs in response to a change or perturbation in its contextual environment or stored state. Unlike isolated local faults, CtxBugs typically emerge from complex mismatches between persistent context, external requirements, and system memory, and they require nontrivial cross-contextual reasoning or verification to detect and resolve. This article examines CtxBugs across autonomous agent architectures, code adaptation workflows, programming models, and system security, and synthesizes both formal definitions and empirical findings from recent arXiv research.
1. Formal Definitions and Taxonomy
The precise definition of a Context Adaptation Bug (CtxBug) varies by computational paradigm but shares an invariant theme: semantic misalignment caused by partial, outdated, or manipulated context information.
In code adaptation (Zhang et al., 10 Jan 2026), let be a code entity correct in original context (passes tests ) but, when ported to target context , fails at least one test in , strictly due to adaptation-induced mismatches. Formally,
In autonomous agentic systems (Patlan et al., 18 Jun 2025), context is structured as , where is the prompt, is perceived data, is static knowledge, and is the memory or plan. Attackers can inject bounded perturbations to corrupt any component:
Critical forms include direct prompt injection , indirect prompt injection , memory injection , and specifically, plan injection , corrupting a high-level plan so that the agent's multi-step reasoning is hijacked.
In LLM code completion, CtxBugs (Dinh et al., 2023) are edits to the code prefix such that all previously valid completions become invalid, even though the prefix remains syntactically and semantically plausible in isolation.
In type systems for context-oriented programming, a CtxBug corresponds to a context-stack mismatch at a dispatch point (i.e., no matching layer), but statically ruled out in sound calculi such as ContextML (Degano et al., 2013).
2. Methodologies for Detection and Generation
Robust evaluation of systems with respect to CtxBugs requires principled benchmarks and error-injection frameworks.
The CtxBugGen framework (Zhang et al., 10 Jan 2026) operationalizes CtxBug construction via four steps:
- Adaptation Task Selection: Identify context-dependent adaptation tasks (interface, functionality, identifier, dependency).
- Task-specific Perturbation: Apply parametric rules (e.g., signature masking, operator substitution) to generate perturbed code templates.
- LLM-based Variant Generation: Solicit completions from LLMs based on these templates and requirements.
- CtxBug Identification: Use AST differencing, test execution, and manual plausibility checks to filter variants that fail in the target context due to adaptation mismatch, ensuring >90% validity.
In autonomous agent benchmarks (Patlan et al., 18 Jun 2025), controlled plan injections and context-chained subtasks are crafted to systematically evaluate architecturally distinct agents (e.g., Agent-E, Browser-use).
For code completion, two benchmarks are prominent (Dinh et al., 2023):
- buggy-HumanEval: Flipping operators in known-good solution lines to create minimally invasive, semantics-altering bugs.
- buggy-FixEval: Extracting buggy prefixes from real developer submissions just prior to a correct fix, ensuring the CtxBug reflects realistic developer errors.
3. Empirical Findings Across Domains
Code Adaptation and LLMs
Comprehensive evaluation across four state-of-the-art LLMs (GPT-4o, DeepSeek-V3, Qwen3-Coder-Plus, Kimi-K2) demonstrates that LLMs resolve at most 56% of CtxBugs on code adaptation tasks (Pass@1). For Functionality Customization CtxBugs, the resolution rate drops to approximately 27%. The presence of CtxBugs yields a mean 23% drop in Pass@1 compared to bug-free adaptation contexts, and 10% relative to isolated bug contexts (IsoBugs). LLMs tend to overlook CtxBugs (60% of failures are exact replications), display overconfidence in buggy code (average token probability ≈ 0.983), and struggle particularly with nonlocal, cross-context constraints (Zhang et al., 10 Jan 2026).
In standard code completion, typical open-source Code-LLMs show pass@1 drops exceeding 50 percentage points in the presence of a single context bug: e.g., CodeGen-2B-mono's pass@1 falls from 54.9% (clean) to 3.1% (buggy) on buggy-HumanEval (Dinh et al., 2023). Post-hoc repair strategies (localization, rewriting, or whole-prefix removal) recover only up to ~25% for synthetic bugs, and under 10% for realistic ones.
Autonomous Web Agents and Security
Plan injection attacks targeting client-side or third-party agent memory succeed at substantially higher rates than prompt-based attacks. On privacy exfiltration tasks, context-chained injections increase attack success rate (ASR) by 17.7% over task-aligned injections. With industry-standard prompt-injection defenses enabled, all prompt-based attacks are neutralized (ASR ≈ 0%), but plan injections persist, achieving 46% ASR on Agent-E and 63% on Browser-use (Patlan et al., 18 Jun 2025).
Context-Oriented Programming
In formal programming language settings, static type systems with context annotations (e.g., ContextML) rule out CtxBugs entirely by augmenting function types and dispatch mechanisms with precise layer effects and context invariants. Well-typed programs are guaranteed "dispatch never fails," preventing all runtime context adaptation errors (Degano et al., 2013).
4. Attack Vectors, Threat Models, and Error Taxonomy
CtxBugs are exposed by both benign workflows (code adaptation, completion) and adversarial manipulations (plan injection, memory overwrite) depending on deployment.
- Attack vectors in agentic systems: Attackers with access to unprotected client-side or extension-managed memory—without the ability to alter user prompts or code—can introduce persistent CtxBugs by mutating plan or history entries. Context-chained injections maximize both semantic similarity to the user's goal and attacker objectives, leveraging hierarchical plausibility checks to evade embedding-based or rule-based mitigation (Patlan et al., 18 Jun 2025).
- Failure modes in code adaptation: Error analysis reveals primary categories: overlooked (replicated bug), invalid (incorrect repair semantics), unexpected (fix introduces new bug), and miscellaneous. Representative errors include mishandling of operator semantics (e.g., misusing '+' for bitwise '|'), incorrect context-sensitive handling (e.g., dependency omission), and misunderstanding adaptation intent (Zhang et al., 10 Jan 2026).
- Benchmark construction: The CtxBugGen and buggy-HumanEval/FixEval benchmarks systematically inject and validate CtxBugs, supporting fine-grained evaluation of model robustness.
5. Evaluation Metrics and Benchmark Results
The assessment of CtxBugs leverages a blend of functional and structural measures:
| Metric | Description | Representative Value |
|---|---|---|
| Pass@k | Probability at least one of k completions is functionally correct | Pass@1 ≈ 55% (best model) |
| Resolution Rate | Exact textual/AST match repair at each bug location | RR ≈ 52% (best model) |
| Attack Success Rate (ASR) | Fraction of adversarial runs successfully achieving attacker objective | 46–63% post-defenses |
| Relative Drop | Degradation between bug-free and CtxBug contexts | ≈23% (code adaptation) |
Benchmarks confirm that CtxBugs are uniquely challenging: in code adaptation, LLMs exhibit much lower repair rate for CtxBugs versus IsoBugs (isolated bugs), evidencing the nonlocal reasoning required. In web agentic security, plan injection vectors bypass known prompt-injection mitigations, necessitating system-level countermeasures (Patlan et al., 18 Jun 2025, Zhang et al., 10 Jan 2026, Dinh et al., 2023).
6. Mitigation Strategies and Defenses
Effective CtxBug mitigation leverages defensive designs at multiple system layers:
- Model-level Integrity Checking: Incorporates embedding-based or fine-tuned modules to detect semantic inconsistencies between the user's original intent/prompt and any context retrieved from memory or plans. Plan edits are rejected if their cosine similarity with the user goal is below a threshold (Patlan et al., 18 Jun 2025).
- Principled Memory Management: Applies cryptographic signatures to plan entries, enforces append-only (write-once, read-many) strategies, and isolates planning state from untrusted client-side or third-party code.
- Type-Theoretic Safety: In the context of programming languages, effect-style type systems with context annotations prevent dispatch failures at compile time (Degano et al., 2013).
- Repair Pipelines: In code LLMs, pre- and post-completion repair (likelihood-based localization, infilling, and program-repair models) shows only moderate recovery, particularly on complex or realistic CtxBugs (Dinh et al., 2023).
- Benchmark-Aware Training: A plausible implication is that training code LMs jointly on adaptation and repair tasks, or augmenting code prompts with explicit context reflection, may improve future robustness (Zhang et al., 10 Jan 2026).
7. Open Problems and Future Directions
Key open challenges include:
- Cross-Context Reasoning: LLMs frequently fail to integrate information spanning multiple code regions, architectural contexts, or memory stores, motivating work on new prompting paradigms or inductive biases.
- Better Bug Localization: Current heuristics are suboptimal, especially for complex, nonlocal context breaks. Graph-based and hybrid static/dynamic analyses may enhance bug detection and repair selection.
- Security in Agentic Systems: Plan and context-chain injections will remain active threats as long as agent state is externally mutable and unverified. Defenses must integrate memory integrity and formal validity at the architectural level (Patlan et al., 18 Jun 2025).
- Benchmark Coverage: Existing evaluations are limited to selected perturbation types; richer and more representative CtxBug datasets are critical for driving progress.
- IDE and HCI Integration: Real-world deployment will require user-facing code tools that highlight suspicious context-adaptation sites in situ, possibly with interactive repair or warning mechanisms (Dinh et al., 2023).
Research consensus indicates that CtxBugs remain a critical weakness of today's code generation and autonomous agent systems, with best-in-class models resolving at most half of realistic adaptation bugs. Comprehensive progress will require advances in context reasoning, memory integrity, and robust evaluation methodologies (Patlan et al., 18 Jun 2025, Zhang et al., 10 Jan 2026, Dinh et al., 2023, Degano et al., 2013).