Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

Published 12 May 2026 in cs.NI, cs.AI, and cs.CR | (2605.12729v1)

Abstract: LLMs are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a formal autonomy ladder for LLM agents in NetOps/AIOps, establishing safety contracts and performance benchmarks.
It details structured tool-use boundaries, non-bypassable verification gates, and strict budget constraints to maintain operational integrity.
Evaluation focuses on workflow performance, drift resistance, and auditability, ensuring robust, self-healing network and IT operations.

LLMs for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

Introduction and Context

The paper "LLMs for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety" (2605.12729) provides an exhaustive technical treatment of the integration, constraints, and operational requirements for deploying LLM-based agents in network operations (NetOps) and IT operations (AIOps). As networks and services have undergone dramatic increases in scale, velocity of change, and tooling complexity, the operational burden on engineering teams has intensified. Advanced automation has, since the early 2000s, been conceptualized as a pursuit of self-managing, self-healing systems, but LLMs offer a fundamentally new interface modality: language as both expressive control plane and evidence synthesizer. This work situates agentic LLM deployment in NetOps/AIOps as an evolution towards closed-loop, partially- or fully-autonomous workflows with non-bypassable safety boundaries.

Architectural Framework: Autonomy Rungs, Tool Scopes, and Verification

The authors introduce a formal ladder of autonomy, ranging from read-only copilots, through tool-grounded analysts, to planner-executor agents and closed-loop self-healing systems. Each rung is rigorously formalized as an assurance contract: $(T_k, R_k, G_k, U_k, B_k)$ , specifying permitted tools, required evidence, mandatory gates/verifiers, rollout protocol, and operational budgets.

Central claims:

Operational safety is derived not from the LLM but from the machinery of constrained interfaces, tool typing, provenance- and freshness-aware retrieval, explicit budgets, non-bypassable verification gates, and audit-integrable action proposals. Agent actions, especially write operations, are mediated by policy, invariant and approval gates (e.g., header space analysis [2], Batfish-style simulation [1], real-time verification such as VeriFlow [3]), guaranteeing isolation, reachability, loop-freedom, and update safety—transferrable to AIOps via SLO guards, error budgets, and blast radius checks.

Crucially, the tool-use boundary (read/query vs. propose/commit) is identified as the core trust perimeter; all agent output intended for effecting change must traverse typed, independently verified interfaces, registered in provenance logs. This apparatus constitutes the verification wall: a non-bypassable function $g(a, E, \mathbb{II}, \mathbb{I})$ .

Technical Taxonomy of Agentic Roles and Artefacts

A highly structured taxonomy is defined as follows:

Copilot (Read-only): Summarization, evidence aggregation, query suggestion; cannot execute mutations.
Analyst (Tool-Grounded): Hypothesis management, causal graph traversal, RCA with structured evidence linkage.
Planner-Executor (Write-Limited): Propose and (under constraint) execute diffs; non-bypassable verification, canary-based rollout, rollback readiness mandatory.
Closed-Loop (Self-Healing): Continuous, bounded effectuation integrated with autonomous monitoring/rollback; only feasible for tightly-scoped, strongly verified subsystems.

Each role interacts with artefacts classified by trust and freshness. Authoritative artefacts (policy-as-code, config baselines) are separated from advisory/untrusted artefacts (runbooks, tickets, chat transcripts). Agentic evidence traces $E = \{(T_i, y_i)\}$ constitute first-class operational objects, directly supporting auditability and RCA.

Retrieval-Augmented Reasoning, Schema Discipline, and Adaptation

The operational knowledge ground is continually shifting; the paper mandates retrieval-augmented generation with explicit focus on provenance and freshness. Time-indexed retrieval, redaction prior to model ingestion, and differentiated trust levels amongst artefacts are positioned as non-negotiable controls. Prompts are insufficient for tasks beyond summarization; all agent workflows must generate structured proposals conforming to typed schemas, versioned to guard against drift-induced failures. Schema adaptation is emphasized for tool-selection reliability, ambiguity handling, and risk-sensitive stopping behavior.

Constraints as Budgets and Stopping Rules

Operational agents are constrained along tool-call budget ( $B_\text{tool}$ ), token budget ( $B_\text{tok}$ ), time budget ( $B_\text{time}$ ), and action risk budget ( $B_\text{risk}$ ). Stopping rules are formal first-class citizens, encompassing budget exhaustion, insufficient evidence, failed gates, unresolved contradiction, or ambiguity requiring escalation. This makes system reliability, calibratable cautiousness, and workflow auditability explicit and quantifiable.

NetOps and AIOps: Domains, Primitives, and Safety

NetOps tasks are mapped as configuration synthesis, closed-loop intent enforcement, and post-change monitoring. The state abstraction and gate formalism connect to properties like reachability, isolation, waypoint enforcement, and transient safety during rollout, codified as runtime invariants. Update protocols are dissected into canary, phased expansion, and continuous rollback hooks, with explicit linkages to coverage-driven risk scoring [56,112].

AIOps is characterized by multi-modal evidence integration (logs, traces, metrics, tickets), tool-grounded RCA, and mitigation planning. Tool-augmented LLMs are positioned as coordinators of evidence gathering, querying, and diagnosis substantiated through structured artefacts and explicitly scored hypotheses, not end-state textual coherence.

Causal reasoning is promoted as a complementary backbone, with LLMs orchestrating evidence gathering and disambiguation above a causal inference substrate. The survey reviews recent advances in event-graph causal RCA [117–119], effect estimation in telco settings [114–116], and multi-modal fusion [38–39].

Evaluation: Task and Metrics Definition, Trace Scoring, Benchmarking Practice

Workflow performance, not static accuracy, is mandated as the axis for evaluation. Metrics glossaries encompass MTTD, MTTR, RCA@k, policy violation rate, rollback rate, unnecessary/action cost, tool invocation counts, latency, and explainability (causal path quality). Stopping behavior and abstention calibration are directly measured.

Agentic benchmarks are required to expose:

Realistic tool surfaces and APIs.
Tool traces and evidence trails.
Ground truth for action outcomes, including rollbacks, under perturbed, drifted, and adversarial data.

Offline evaluation protocols stress drift-stress testing, duplicate detection, provenance integrity, and explicit reporting of policy violations and recoverability. LLM-based "judging" is considered at most auxiliary; ground-truth trace compliance is primary.

Security, Safety, and Governance as Enforceable System Properties

The threat model for agentic NetOps/AIOps is articulated with (i) prompt injection via operational artefacts, (ii) RAG poisoning, (iii) telemetry integrity attacks, (iv) tool misuse/excessive agency, and (v) data exfiltration. Technical controls are mapped to each threat:

Structured input filtering, provenance-aware retrieval, tool allow-lists, role-based redaction, and tamper-evident audit trails.
Non-bypassable gating mechanisms for any privileged tool use.
Privacy by minimization, pre-model redaction, incident-scoped retention, and schema-level governance.

Security evaluation protocols require explicit attacker surfaces, injected/poisoned variant runs, and reporting of tool-call/approval bypasses and refusal-storms.

Open Problems and Research Agenda

The paper codifies the open research questions as formal contract satisfaction problems with measurable evaluation handles:

Operationalizing autonomy rungs as contracts: measuring how often system traces satisfy all contract clauses ( $T \models C_k$ ), not superficial metrics.
Compositional and transient safety: verifying invariants across both steady-state and rollout timelines.
Causal-LLM integration: cost, efficiency, and explainability trade-offs.
Robustness under continual tool, signal, and artefact drift.
Schema standardization for tool calls, evidence, diffs, approvals, and auditability.

Central, non-negotiable assertion: closed-loop autonomy in critical infrastructure requires exogenous, system-level checks and strong audit/archeology, validated through realistic, adversarial, and drifted evaluations. Unconstrained model-centric autonomy remains fragile.

Conclusion

The deployment of LLM-based agents for NetOps and AIOps mandates a system view in which the locus of reliability, security, and operational accountability sits outside the model: structured, typed, verification-heavy interfaces, staged execution with audit and rollback, and workflow contracts enforceable by external gates. Real progress demands explicit measurement of workflow trace compliance, policy-violation resilience, and response to drift and adversarial conditions.

Benchmarking and evaluation must report operational artifacts, not only final task accuracy, and security claims must be substantiated by concrete, attacker-aware protocols. The survey advocates a transition from model-centric to contract-centric deployment and evaluation, with autonomy defined by auditable and enforceable commitments at each operational rung. Future developments must address compositional safety, hybridization with causal inference, and schema-level standardization for scalable, robust, and governable autonomy in production-grade NetOps and AIOps systems.

Markdown Report Issue