- The paper presents a novel hybrid verification framework that integrates an uncertainty-calibrated LSTM detector with a role-specialized LLM council to achieve high fault validation rates.
- The methodology leverages a hierarchical, adaptive clone-and-promote strategy to adjust system parameters in real time while avoiding catastrophic forgetting.
- Experimental results demonstrate significant improvements in reducing false positives and enhancing fault validation in high-stakes autonomous UUV environments.
AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems
Motivation and Problem Statement
Automated control systems in high-stakes environments, such as Unmanned Underwater Vehicles (UUVs), require rigorous mechanisms for Verification and Validation (V&V). Standard deep learning-based anomaly detectors, especially those based on LSTM and MC Dropout, achieve high sensitivity but lack robust classification of root causes, resulting in frequent nuisance faults attributed to environmental noise or legitimate transient dynamics. This impedes their deployment in mission-critical contexts and places a significant unsustainable burden on Human-in-the-Loop (HITL) analysts. Despite recent innovations in LLM-based reasoning and Multi-Agent Systems (MAS), direct application of LLMs remains untrustworthy due to hallucination risks and excessive latency. The AIVV framework directly addresses the critical bottlenecks of existing solutions: scalable semantic validation of anomalies, real-time actionable V&V, and adaptive system retuning without introducing catastrophic model drift.
Architectural Overview
AIVV is structured as a hierarchical hybrid neuro-symbolic system, tightly coupling a high-frequency, uncertainty-calibrated LSTM detector with a deliberative, role-specialized LLM-based agent council, followed by an adaptive retuning pipeline.
Figure 1: AIVV framework, illustrating the sequential flow of the system.
The first tier, the Mathematical Engine, utilizes an MC Dropout LSTM for calibrated point-wise residual and epistemic uncertainty estimation, refined with a conformal prediction layer to operationalize finite-sample statistical guarantees. An anomaly is flagged when residuals exceed a dynamic conformal bound, invoking the “Sentry” gate and escalating the sample to the LLM Council only upon violation. This tightly bounds LLM compute costs and preserves system responsiveness.
Upon escalation, three role-specialized LLM council agents—Requirements Engineer, Failure Manager, and System Engineer—perform collaborative semantic adjudication. Each agent executes domain-specific protocol over telemetry, requirements, and model-generated uncertainty, casting structured votes and providing requirement-referenced justifications. Council decisions use a strict 2-of-3 majority rule, reducing the risk of individual agent hallucination. When a sample is determined as a nuisance rather than a true fault, the adaptation pipeline is triggered, employing Inspector and Tuner agents to synthesize parameter adjustment strategies (either recalibration of conformal bounds or fine-tuning).
The pipeline integrates a clone-and-promote strategy: model adaptations are initially applied to a shadow copy and only promoted after passing the mathematical Sentry post-adaptation, thereby mitigating catastrophic forgetting.
Experimental Protocol
Evaluations are conducted on sensor-fusion-based dynamic yaw time-series data from a REMUS 100 UUV Simulink model, encompassing three scenarios: Hovering, Lawnmower Mapping, and Complex Missions, each under realistic sensor drift and non-Gaussian environmental noise conditions.


Figure 2: Dataset 1.
Structured fault injections—both electrical (sensor) and mechanical (damper)—are introduced to assess the capacity for distinguishing between nuisance and genuine system failures.
The AIVV council is tested across 75 seeds with 409 test points per seed, measuring both the Fault Validation Rate (FVR) and improvement in detection accuracy pre- and post-adaptation for each experimental condition.
Results and Empirical Analysis
Fault Validation and Suppression of False Positives
Purely mathematical gating, as provided by the MC Dropout LSTM plus Sentry, yields high false positive rates, especially in scenarios of complex transient behavior. Integration of the LLM Council dramatically suppresses this FPR: for the Hovering dataset, FVR improves from 45.33% (math-only) to 98.67% with council adjudication, and to 100% once the adaptation pipeline is included. For the Complex Mission, the FVR increases from 0% (math-only) to 73.33% (council), and 93.33% (full AIVV integration).
Figure 3: Ablation study comparing three framework stages (rows) across three test scenarios (columns). The mathematical baseline exhibits a high false-positive rate (FPR), which is visibly reduced when the LLM council is introduced. The full AIVV framework integrates the adaptation pipeline to achieve optimal validation.
These results demonstrate that council-based semantic validation, based on operational requirements and telemetry context, robustly distinguishes genuine faults from noise-induced anomalies that elude classic statistical detection. Notably, in the most challenging conditions (Complex Mission), adaptation provides a 23.11% accuracy improvement post-tuning, highlighting the necessity of the clone-and-promote corrective loop.
Adaptation Pipeline and Model Safety
The adaptation protocol, mediated by Inspector and Tuner agents, leverages JSON logs and context-driven analysis to synthesize conservative recalibration (adjusting α for conformal bounds within [0.01,0.10]) or network fine-tuning (with tuned epochs and learning rates). Promotions to the live model only occur when a post-adjustment candidate passes the Sentry, ensuring that safety is never compromised by adaptation and that incremental retraining does not induce catastrophic memory loss.

Figure 4: Before gain-tuning.
Yields from the adaptation pathway display substantial gains for non-stationary, high-complexity regimes, while incurring negligible extra latency or context overhead in routine scenarios.
Role-Specific LLM Model Selection
A systematic ablation of council agent model assignments demonstrates that optimal performance is contingent on the alignment of LLM architecture family and council role. High-param GPT-OSS models outperform for sequential trajectory analysis (Failure Manager), medium-scale LLaMA excels at static rule enforcement (Requirements Engineer), and only 70B+ LLMs can robustly synthesize domain reasoning and produce valid JSON proposals (System Engineer). Arbitrary shifts in agent assignments cause a precipitous drop in FVR, highlighting the necessity for heterogeneous, role-specialized LLM selection.
Practical and Theoretical Implications
AIVV establishes a scalable, digitized analog of HITL V&V by integrating symbolic natural-language reasoning with statistical anomaly detection. This two-tiered architecture preserves mathematical rigor, eliminates hallucinatory semantic noise, and enables structured, documented system adaptation proposals, such as automatic gain-tuning recommendations for PID controllers.
Practically, this reduces human V&V bottlenecks in multi-sensor, high-frequency data streams, yielding system designs that are robust to both evolving environments and distributional drift. The clone-and-promote adaptation guarantees that operational safety envelopes are never compromised, and that system upgrades are time-stamped and auditable, facilitating certification in regulated domains.
Theoretically, AIVV demonstrates that coupling formal conformal bounds with MAS-based LLM deliberation produces a hybrid architecture that is provably more contextually aware, robust, and actionable than either approach alone, while remaining computationally tractable via smart escalation protocols.
Conclusion
AIVV presents a rigorous, role-aligned neuro-symbolic V&V pipeline for autonomous system control, capable of reducing the operator workload inherent in existing HITL systems while providing high-precision anomaly validation, adaptation, and engineering artifact generation. The demonstrated improvements in FVR, suppression of FPR, and safe online adaptation point to a scalable path for fully autonomous, LLM-integrated oversight architectures.
Future research directions include deeper integration of gain-tuning recommendations into inner closed-loop controllers, further MAS council diversity, and rigorous real-world deployment in varied cyber-physical domains.
Reference:
AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems (2604.02478)