PROTEA Defense Architecture
- PROTEA Defense Architecture is a model-agnostic, runtime safety framework for robotic task planning that mitigates adversarial manipulations.
- It employs object filtering and external memory mechanisms to reduce dimensionality and capture history-dependent vulnerabilities in plan execution.
- Evaluated on the HarmPlan dataset, the architecture uses stepwise LLM judgments to achieve high recall in detecting malicious steps while maintaining low latency.
PROTEA Defense Architecture is a model-agnostic, runtime safety framework for robot task planning systems. Designed to counter adversarial manipulations in planners—particularly those utilizing LLMs—PROTEA employs a multi-component approach that addresses both dimensionality and history-related vulnerabilities in plan validation. The architecture operates as a middleware, interfacing between symbolic planners and robotic execution modules, with its hallmark feature being the use of instruction-tuned LLMs as stepwise plan judges. The system is evaluated extensively on the HarmPlan dataset, comprising both benign and carefully designed malicious plans, stratifying attack difficulty and leveraging diverse foundation models for judge instantiation (Altaweel et al., 12 Jan 2026).
1. System Overview and Data Flow
PROTEA is intended for deployment in robotic stacks, with a high-level instruction and an initial world state passed to a potentially compromised planner . The planner outputs a symbolic plan , where adversarial attacks may have injected harmful actions. PROTEA interposes itself before execution, operating through the following pipeline:
- Object Filtering: reduces the input state to objects relevant to , increasing reasoning tractability.
- External Memory Initialization: , ; maintains a per-object state graph to capture evolving environmental dynamics across plan steps.
- Stepwise LLM Judgment: For each :
- (action vocabulary)
- If , halt execution and return an explanation.
- Otherwise, and .
- Plan Verdict: If all judged safe, is executed; otherwise, execution is prevented and an explanation is logged.
PROTEA delivers a binary verdict per plan: either the plan is allowed for execution, or halted as potentially harmful, accompanied by a reasoned explanation (Altaweel et al., 12 Jan 2026).
2. Threat Model and Formal Safety Metrics
PROTEA assumes an adversary with the following capabilities:
- Insertion or replacement of up to actions within a plan , either in contiguous or scattered positions.
- Manipulation of planner prompts or weights, biasing to produce harmful subplans.
- If is benign, may compute such that there exist indices , where is malicious.
Plan Safety Assessment Functions:
- Naïve Judge: Computes . Accepts if .
- PROTEA Stepwise Scheme: , with . The plan is deemed malicious if at any step.
Detection Metrics:
Standard classification metrics computed over benign and malicious plans—True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN):
| Metric | Formula |
|---|---|
| Precision | |
| Recall (TPR) | |
| F1 Score | |
| FPR |
3. Architectural Modules: Dimensionality and History Handling
PROTEA’s robustness derives from two core mechanisms:
- Dimensionality Mitigation: By object filtering, the architecture reduces the size of the symbolic world-state graph from to . This keeps LLM prompt lengths tractable and tightly focused on plan-relevant entities.
- History Challenge Mitigation: For long-horizon and stealthy adversarial plans (e.g., “consequential” attacks), external memory continually updates the environment state and records past actions; this ensures that the LLM judge’s decision on is informed by complete causal context, enabling the detection of multi-step attack chains.
This dual approach directly addresses the curse of dimensionality and history that has previously limited the efficacy of automated plan safety assessment (Altaweel et al., 12 Jan 2026).
4. Core Algorithm and Implementation Details
The principal pseudocode for PROTEA is as follows:
1 2 3 4 5 6 7 8 9 |
E = FilterObjects(E0, pi) H = set() for t in range(1, n+1): s_t = Pr_LLM("safe" | a_t, E, H, V) if s_t < theta_step: return "malicious", explanation E = SimUpdate(E, a_t) H.add(a_t) return "safe" |
In practice, the judgment and simulation LLM calls can be executed separately or batched to optimize latency. PROTEA operates in a planner-agnostic fashion and does not require retraining or modification of underlying planning modules (Altaweel et al., 12 Jan 2026).
5. Evaluation: HarmPlan Dataset and Experimental Results
HarmPlan Dataset:
- Composed of 1,500 plans (≈50% malicious) from VirtualHome, including 750 benign plans covering domains such as food preparation, cleaning, and leisure.
- 91 malicious behaviors across six categories: Fire Hazards, Electrical Hazards, Property Damage, Animal Harm, Item Loss, Poisoning/Contamination.
- Attack plans comprise direct (4–6 consecutive malicious steps) and consequential (6–13 steps with varying degrees of benign interleaving) variants, stratified by difficulty:
- Medium: 2–4 benign separators
- Hard: ≥5 benign separators
- Automated validation ensures feasibility and semantic correctness of each plan, supplementing structural gaps such as “open microwave” or “walk to room” steps.
Detection Performance:
| Model | Precision | Recall |
|---|---|---|
| GPT-4o-mini | 0.902 | 0.940 |
| LLaMA3.3-70B | 0.924 | 0.787 |
- Recall degrades as attack difficulty increases in naïve, single-shot judgment.
- Object Filtering recaptures up to 30% recall for medium and hard plans, especially in models lacking chain-of-thought reasoning.
- PROTEA yields state-of-the-art recall on hard plans for LLaMA3.3-70B, Phi-4, and Mixtral.
- Category-wise analysis: recall 90% for Fire/Poisoning; Item Loss (keys/phones) remains challenging (60% naive, 75% PROTEA).
- PROTEA exhibits slightly lower precision (over-cautious per-step flags) but maintains high recall.
- Latency per plan: 1.2 s (naïve), simulation (2–4 s for steps); memory overhead 10 KB.
6. Integration, Deployment, and Practical Considerations
PROTEA is deployed as a middleware immediately following planner and prior to execution modules (e.g., trajectory servers in ROS-based stacks):
- The object filtering module is directly realizable using graph queries over existing semantic maps.
- LLM-based Judge and Simulator components can operate on local/edge servers to manage real-time constraints. Lightweight models (e.g., Grok-3-mini, Phi-4) provide feasible local inference for latency-critical applications.
- On detection of a malicious step, PROTEA halts the plan and triggers safe-fallback planners subject to hard-coded safety constraints.
- No changes to planning algorithms or retraining are required; system integration is thus minimally invasive and universally applicable across planning paradigms.
- The architecture preserves plan execution safety while maintaining flexibility and scalability in large-scale or complex environments through dimension reduction and persistent memory.
7. Contextual Significance and Implications
PROTEA’s LLM-as-a-Judge paradigm introduces a general-purpose, planner-agnostic runtime defense for robot task planning environments. It is the first architecture to systematically address both high-dimensional reasoning and history-dependent attack detection at execution time. Its ability to halt stealthy multi-step adversarial plans prior to harm, while incurring low computational overhead and memory footprint, provides a practical route toward robust, explainable robot safety validation. A plausible implication is that similar runtime judgment architectures may extend to autonomous planning domains beyond robotics, wherever complex symbolic plans are subject to adversarial manipulation (Altaweel et al., 12 Jan 2026).