AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems

Published 2 Apr 2026 in cs.AI | (2604.02478v1)

Abstract: Deep learning models excel at detecting anomaly patterns in normal data. However, they do not provide a direct solution for anomaly classification and scalability across diverse control systems, frequently failing to distinguish genuine faults from nuisance faults caused by noise or the control system's large transient response. Consequently, because algorithmic fault validation remains unscalable, full Verification and Validation (V&V) operations are still managed by Human-in-the-Loop (HITL) analysis, resulting in an unsustainable manual workload. To automate this essential oversight, we propose Agent-Integrated Verification and Validation (AIVV), a hybrid framework that deploys LLMs as a deliberative outer loop. Because rigorous system verification strictly depends on accurate validation, AIVV escalates mathematically flagged anomalies to a role-specialized LLM council. The council agents perform collaborative validation by semantically validating nuisance and true failures based on natural-language (NL) requirements to secure a high-fidelity system-verification baseline. Building on this foundation, the council then performs system verification by assessing post-fault responses against NL operational tolerances, ultimately generating actionable V&V artifacts, such as gain-tuning proposals. Experiments on a time-series simulator for Unmanned Underwater Vehicles (UUVs) demonstrate that AIVV successfully digitizes the HITL V&V process, overcoming the limitations of rule-based fault classification and offering a scalable blueprint for LLM-mediated oversight in time-series data domains.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a novel hybrid verification framework that integrates an uncertainty-calibrated LSTM detector with a role-specialized LLM council to achieve high fault validation rates.
The methodology leverages a hierarchical, adaptive clone-and-promote strategy to adjust system parameters in real time while avoiding catastrophic forgetting.
Experimental results demonstrate significant improvements in reducing false positives and enhancing fault validation in high-stakes autonomous UUV environments.

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems

Motivation and Problem Statement

Automated control systems in high-stakes environments, such as Unmanned Underwater Vehicles (UUVs), require rigorous mechanisms for Verification and Validation (V&V). Standard deep learning-based anomaly detectors, especially those based on LSTM and MC Dropout, achieve high sensitivity but lack robust classification of root causes, resulting in frequent nuisance faults attributed to environmental noise or legitimate transient dynamics. This impedes their deployment in mission-critical contexts and places a significant unsustainable burden on Human-in-the-Loop (HITL) analysts. Despite recent innovations in LLM-based reasoning and Multi-Agent Systems (MAS), direct application of LLMs remains untrustworthy due to hallucination risks and excessive latency. The AIVV framework directly addresses the critical bottlenecks of existing solutions: scalable semantic validation of anomalies, real-time actionable V&V, and adaptive system retuning without introducing catastrophic model drift.

Architectural Overview

AIVV is structured as a hierarchical hybrid neuro-symbolic system, tightly coupling a high-frequency, uncertainty-calibrated LSTM detector with a deliberative, role-specialized LLM-based agent council, followed by an adaptive retuning pipeline.

Figure 1: AIVV framework, illustrating the sequential flow of the system.

The first tier, the Mathematical Engine, utilizes an MC Dropout LSTM for calibrated point-wise residual and epistemic uncertainty estimation, refined with a conformal prediction layer to operationalize finite-sample statistical guarantees. An anomaly is flagged when residuals exceed a dynamic conformal bound, invoking the “Sentry” gate and escalating the sample to the LLM Council only upon violation. This tightly bounds LLM compute costs and preserves system responsiveness.

Upon escalation, three role-specialized LLM council agents—Requirements Engineer, Failure Manager, and System Engineer—perform collaborative semantic adjudication. Each agent executes domain-specific protocol over telemetry, requirements, and model-generated uncertainty, casting structured votes and providing requirement-referenced justifications. Council decisions use a strict 2-of-3 majority rule, reducing the risk of individual agent hallucination. When a sample is determined as a nuisance rather than a true fault, the adaptation pipeline is triggered, employing Inspector and Tuner agents to synthesize parameter adjustment strategies (either recalibration of conformal bounds or fine-tuning).

The pipeline integrates a clone-and-promote strategy: model adaptations are initially applied to a shadow copy and only promoted after passing the mathematical Sentry post-adaptation, thereby mitigating catastrophic forgetting.

Experimental Protocol

Evaluations are conducted on sensor-fusion-based dynamic yaw time-series data from a REMUS 100 UUV Simulink model, encompassing three scenarios: Hovering, Lawnmower Mapping, and Complex Missions, each under realistic sensor drift and non-Gaussian environmental noise conditions.

Figure 2: Dataset 1.

Structured fault injections—both electrical (sensor) and mechanical (damper)—are introduced to assess the capacity for distinguishing between nuisance and genuine system failures.

The AIVV council is tested across 75 seeds with 409 test points per seed, measuring both the Fault Validation Rate (FVR) and improvement in detection accuracy pre- and post-adaptation for each experimental condition.

Results and Empirical Analysis

Fault Validation and Suppression of False Positives

Purely mathematical gating, as provided by the MC Dropout LSTM plus Sentry, yields high false positive rates, especially in scenarios of complex transient behavior. Integration of the LLM Council dramatically suppresses this FPR: for the Hovering dataset, FVR improves from 45.33% (math-only) to 98.67% with council adjudication, and to 100% once the adaptation pipeline is included. For the Complex Mission, the FVR increases from 0% (math-only) to 73.33% (council), and 93.33% (full AIVV integration).

Figure 3: Ablation study comparing three framework stages (rows) across three test scenarios (columns). The mathematical baseline exhibits a high false-positive rate (FPR), which is visibly reduced when the LLM council is introduced. The full AIVV framework integrates the adaptation pipeline to achieve optimal validation.

These results demonstrate that council-based semantic validation, based on operational requirements and telemetry context, robustly distinguishes genuine faults from noise-induced anomalies that elude classic statistical detection. Notably, in the most challenging conditions (Complex Mission), adaptation provides a 23.11% accuracy improvement post-tuning, highlighting the necessity of the clone-and-promote corrective loop.

Adaptation Pipeline and Model Safety

The adaptation protocol, mediated by Inspector and Tuner agents, leverages JSON logs and context-driven analysis to synthesize conservative recalibration (adjusting $\alpha$ for conformal bounds within $[0.01, 0.10]$ ) or network fine-tuning (with tuned epochs and learning rates). Promotions to the live model only occur when a post-adjustment candidate passes the Sentry, ensuring that safety is never compromised by adaptation and that incremental retraining does not induce catastrophic memory loss.

Figure 4: Before gain-tuning.

Yields from the adaptation pathway display substantial gains for non-stationary, high-complexity regimes, while incurring negligible extra latency or context overhead in routine scenarios.

Role-Specific LLM Model Selection

A systematic ablation of council agent model assignments demonstrates that optimal performance is contingent on the alignment of LLM architecture family and council role. High-param GPT-OSS models outperform for sequential trajectory analysis (Failure Manager), medium-scale LLaMA excels at static rule enforcement (Requirements Engineer), and only 70B+ LLMs can robustly synthesize domain reasoning and produce valid JSON proposals (System Engineer). Arbitrary shifts in agent assignments cause a precipitous drop in FVR, highlighting the necessity for heterogeneous, role-specialized LLM selection.

Practical and Theoretical Implications

AIVV establishes a scalable, digitized analog of HITL V&V by integrating symbolic natural-language reasoning with statistical anomaly detection. This two-tiered architecture preserves mathematical rigor, eliminates hallucinatory semantic noise, and enables structured, documented system adaptation proposals, such as automatic gain-tuning recommendations for PID controllers.

Practically, this reduces human V&V bottlenecks in multi-sensor, high-frequency data streams, yielding system designs that are robust to both evolving environments and distributional drift. The clone-and-promote adaptation guarantees that operational safety envelopes are never compromised, and that system upgrades are time-stamped and auditable, facilitating certification in regulated domains.

Theoretically, AIVV demonstrates that coupling formal conformal bounds with MAS-based LLM deliberation produces a hybrid architecture that is provably more contextually aware, robust, and actionable than either approach alone, while remaining computationally tractable via smart escalation protocols.

Conclusion

AIVV presents a rigorous, role-aligned neuro-symbolic V&V pipeline for autonomous system control, capable of reducing the operator workload inherent in existing HITL systems while providing high-precision anomaly validation, adaptation, and engineering artifact generation. The demonstrated improvements in FVR, suppression of FPR, and safe online adaptation point to a scalable path for fully autonomous, LLM-integrated oversight architectures.

Future research directions include deeper integration of gain-tuning recommendations into inner closed-loop controllers, further MAS council diversity, and rigorous real-world deployment in varied cyber-physical domains.

Reference:

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems (2604.02478)

Markdown Report Issue