Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Agent Penetration Testing AI for the Web

Published 28 Aug 2025 in cs.CR and cs.AI | (2508.20816v1)

Abstract: AI-powered development platforms are making software creation accessible to a broader audience, but this democratization has triggered a scalability crisis in security auditing. With studies showing that up to 40% of AI-generated code contains vulnerabilities, the pace of development now vastly outstrips the capacity for thorough security assessment. We present MAPTA, a multi-agent system for autonomous web application security assessment that combines LLM orchestration with tool-grounded execution and end-to-end exploit validation. On the 104-challenge XBOW benchmark, MAPTA achieves 76.9% overall success with perfect performance on SSRF and misconfiguration vulnerabilities, 83% success on broken authorization, and strong results on injection attacks including server-side template injection (85%) and SQL injection (83%). Cross-site scripting (57%) and blind SQL injection (0%) remain challenging. Our comprehensive cost analysis across all challenges totals $21.38 with a median cost of $0.073 for successful attempts versus $0.357 for failures. Success correlates strongly with resource efficiency, enabling practical early-stopping thresholds at approximately 40 tool calls or $0.30 per challenge. MAPTA's real-world findings are impactful given both the popularity of the respective scanned GitHub repositories (8K-70K stars) and MAPTA's low average operating cost of $3.67 per open-source assessment: MAPTA discovered critical vulnerabilities including RCEs, command injections, secret exposure, and arbitrary file write vulnerabilities. Findings are responsibly disclosed, 10 findings are under CVE review.

Summary

  • The paper presents MAPTA, a multi-agent AI that autonomously assesses web vulnerabilities using LLM-guided orchestration and sandboxed execution.
  • It achieves a 76.9% success rate on 104 challenges, excelling in SSRF, injection, and misconfiguration tests while minimizing resource costs.
  • MAPTA couples strategic planning with secure, isolated execution and proof-of-concept validation to optimize penetration testing and resource allocation.

Multi-Agent Architectures for Autonomous Web Application Penetration Testing: An Analysis of MAPTA

Introduction

The paper "Multi-Agent Penetration Testing AI for the Web" (2508.20816) presents MAPTA, a multi-agent system for autonomous web application security assessment. MAPTA is designed to address the scalability crisis in security auditing precipitated by the proliferation of AI-generated code, which is frequently insecure. The system leverages LLM orchestration, tool-grounded execution, and mandatory end-to-end exploit validation to autonomously discover and verify vulnerabilities in web applications. This essay provides a technical analysis of MAPTA’s architecture, evaluation methodology, empirical results, and implications for the future of AI-driven security assessment.

System Architecture and Agent Design

MAPTA’s architecture is predicated on a three-role agent model: Coordinator, Sandbox, and Validation agents. The Coordinator agent is responsible for high-level attack-path reasoning, tool orchestration, and report synthesis. Sandbox agents execute tactical steps within a shared per-job Docker container, providing isolated LLM contexts for focused sub-tasks. The Validation agent is tasked with converting candidate findings into verified, end-to-end proof-of-concept (PoC) exploits, returning pass/fail evidence based on concrete execution. Figure 1

Figure 1: MAPTA multi-agent architecture with single-pass controller and evidence-gated branching. The Coordinator orchestrates, Sandbox agents execute in a shared container, and the Validation agent verifies PoCs.

This separation of concerns enables MAPTA to couple high-level strategic planning with secure, isolated execution and empirical validation. The architecture supports both blackbox (CTF-style) and whitebox (source-available) assessment modes, with dynamic orchestration logic that adapts to the operational context. All agents for a given assessment share a single Docker container, amortizing setup costs and enabling stateful reuse of artifacts, while maintaining LLM context isolation to prevent cross-talk and prompt bloat.

Evaluation Methodology

MAPTA is evaluated on the XBOW benchmark, comprising 104 web application security challenges designed for autonomous penetration testing. The evaluation is conducted under blackbox conditions, with MAPTA receiving only the target URL and challenge description. The system’s performance is measured using four objective metrics: binary success (flag discovery), time to solution, computational cost (LLM API usage), and tool execution efficiency.

MAPTA achieves a 76.9% overall success rate, with perfect performance on SSRF and misconfiguration vulnerabilities, 83% on broken authorization, and strong results on injection attacks (SSTI: 85%, SQLi: 83%). Cross-site scripting (XSS) and blind SQL injection remain challenging, with 57% and 0% success rates, respectively. Figure 2

Figure 2: Cumulative distribution of challenge completion times. Solved challenges have a median time of 96.1s; unsolved challenges, 508.9s.

Figure 3

Figure 3: CDF of total costs and per-challenge cost by token type. Solved challenges have a median cost of \$0.073; unsolved, \$0.357. Output tokens are the largest cost component.

Figure 4

Figure 4: Cumulative distribution of token usage by type. Cached token utilization contributes to cost efficiency; reasoning tokens reflect analytical processing.

MAPTA’s resource accounting is rigorous, with a total cost of \$21.38 for all challenges, a median cost of \$0.073 for successful attempts, and \$0.357 for failures. Output tokens are the primary cost driver, reflecting the system’s analytical reasoning requirements. The system demonstrates adaptive tool selection, with a preference for direct command execution (notably curl and bash), and an average of 25.1 tool calls per challenge. Figure 5

Figure 5: Tool usage patterns across challenges. Command execution dominates over Python runtime calls; total tool invocations per challenge are shown.

Figure 6

Figure 6: Command usage heatmap. Curl dominates, reflecting HTTP-centric testing; bash usage indicates complex exploitation.

Correlation and Efficiency Analysis

A key contribution of MAPTA is the quantification of negative correlations between resource utilization and success. Statistically significant negative correlations are observed between success and tool usage (r = -0.661), cost (r = -0.606), token usage (r = -0.587), and time (r = -0.557). Figure 7

Figure 7: Correlation analysis between challenge success and resource utilization. Negative correlations indicate efficient solutions for successful challenges.

These findings indicate that successful challenges are solved quickly and efficiently, while failures are characterized by extensive exploration and higher resource consumption. This enables the implementation of practical early-stopping thresholds (e.g., 40 tool calls, \$0.30, or 300 seconds), optimizing resource allocation in production deployments.

Vulnerability Category Performance

MAPTA’s performance is analyzed across 13 vulnerability categories, spanning the majority of the OWASP Top 10 (2021) and several API Top 10 (2023) families. Figure 8

Figure 8: Vulnerability category distribution across 104 XBOW challenges. 13 categories spanning 8/10 OWASP Top 10 (2021).

MAPTA excels at SSRF, misconfiguration, and injection vulnerabilities (SSTI, SQLi, command injection), but underperforms on XSS and blind SQLi. Authorization vulnerabilities (IDOR, privilege escalation) are detected with high accuracy (83%), while authentication and business logic flaws remain challenging. The system’s empirical validation approach reduces false positives but may yield false negatives for complex business logic vulnerabilities that require deeper semantic understanding.

Real-World Application Assessment

MAPTA is further evaluated on 10 popular open-source web applications (8K–70K GitHub stars), representing diverse technology stacks. The system discovers 19 vulnerabilities across 6 applications, with 14 classified as high or critical severity (including RCE, command injection, secret exposure, and arbitrary file write). The average assessment cost is \$3.67 per application, with a mean assessment time of 50.7 minutes. Figure 9

Figure 9: Vulnerability distribution and assessment costs across targets. Stacked bars show severity; orange line indicates assessment costs.

Figure 10

Figure 10: Assessment time versus vulnerability discovery patterns. Labels indicate vulnerability types found.

Notably, the cost of assessment does not directly correlate with the number or severity of findings, indicating that some complex applications are secure while others yield critical vulnerabilities at low computational cost. All findings are responsibly disclosed, with 10 under CVE review.

Implications and Future Directions

MAPTA demonstrates that multi-agent LLM-driven architectures can achieve competitive performance in autonomous web application security assessment, with rigorous cost-performance accounting and empirical exploit validation. The system’s open-source implementation and reproducible evaluation methodology address the reproducibility gap in prior commercial and academic systems.

The negative correlation between resource usage and success provides actionable heuristics for early stopping and resource budgeting, a critical consideration for large-scale or continuous deployment. The empirical validation approach reduces false positives, but further research is needed to address false negatives in business logic and timing-based vulnerabilities (e.g., blind SQLi).

Future developments may include enhanced payload generation, feedback-driven exploration, integration of browser-based agents for DOM-centric vulnerabilities, and automated canary placement for business logic validation. Scaling MAPTA to larger, more complex applications and integrating with CI/CD pipelines for continuous assessment are promising directions.

Conclusion

MAPTA establishes a new standard for autonomous penetration testing by combining multi-agent orchestration, tool-grounded execution, and mandatory exploit validation. The system achieves 76.9% success on the XBOW benchmark, with strong performance on critical vulnerability classes and comprehensive cost accounting. The open-science approach, rigorous evaluation, and actionable resource optimization strategies position MAPTA as a practical and scientifically robust framework for AI-driven security assessment. The results underscore the feasibility of deploying autonomous agents for continuous, scalable web application security testing, while highlighting open challenges in business logic and advanced exploitation scenarios.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Below is a single, focused list of the paper’s unresolved knowledge gaps, limitations, and open questions that future work could concretely address:

  • Lack of head-to-head, reproducible baselines: no controlled comparison against PentestGPT, PenHeal, or commercial XBOW under identical targets, budgets, and stopping criteria.
  • Model dependence unexamined: evaluation uses only GPT-5; missing ablations across model families/sizes, reasoning modes, temperatures, tool-use prompting styles, and context window sizes.
  • Run-to-run variance not reported: no multiple trials per challenge to quantify stochastic performance, confidence intervals, or stability across seeds.
  • Blind SQL injection capability gap (0% success): absence of time-based inference, out-of-band exfiltration (DNS/HTTP callbacks), second-order strategies, and resilient delay measurement under jitter.
  • XSS handling limited (57% success): no headless browser/DOM instrumentation (e.g., Playwright/Selenium), CSP-aware payload generation, sink discovery, or client-side taint tracking.
  • Weak authentication analysis (33% success): insufficient modeling of session state, cookie/JWT manipulation, MFA flows, password reset/invite flows, lockouts, and account lifecycle exploitation.
  • Business-logic vulnerability validation remains heuristic: no systematic framework for state-space exploration, role-based scenario generation, canary instrumentation, or ground-truth datasets for logic flaws.
  • Out-of-scope OWASP categories: A08 (Software and Data Integrity Failures) and A09 (Logging/Monitoring Failures) not evaluated; API abuse topics like rate limiting, resource consumption, and DoS also omitted.
  • Protocol coverage narrow: focus on HTTP; no support for WebSockets, GraphQL, gRPC, or SOAP with schema-aware exploration and mutation.
  • Limited evaluation on modern SPAs: CTF apps are simple; no robust crawling and execution of complex front-end code (routing, state management, API orchestration).
  • Real-world validation scope small and not benchmarked: 10 repositories, whitebox only, anonymized targets; lacks comparison to expert pen-test results, bug bounty outcomes, and quantification of recall/precision.
  • False-positive/negative rates not measured: PoC validation reduces false positives but no labeled benchmarks to quantify both FP and FN at scale.
  • Early-stopping thresholds proposed but untested: need prospective evaluation of adaptive stopping policies to avoid missing slow-to-exploit vulnerabilities.
  • Agent architecture ablations missing: no study of the number of sandbox agents, concurrency, scheduling, role configurations, or evidence-gated branching strategies on performance and cost.
  • Tooling breadth limited and inconsistent: paper mentions nmap/ffuf but table shows only run_command/run_python; lacks integration with specialized tools (e.g., sqlmap, nuclei, wafw00f) and automatic parameter tuning.
  • No headless browser/JS instrumentation pipeline: needed for XSS, CSRF, DOM-based vulns, postMessage issues, and client-side security policies (CSP/SRI).
  • No endpoint/code coverage metrics: no measurement of breadth/depth of exploration (endpoints hit, code paths), leaving blind spot detection unresolved.
  • Reproducibility constrained by proprietary models: GPT-5 dependence; need publishable prompts, seeds, versioned container images, toolchain scripts, and finalized Docker fixes to ensure full reproducibility.
  • Isolation and safety analysis limited: only Docker; no assessment of exploit side-effects, container escape risks, resource quotas, or secure handling/disposal of discovered secrets.
  • Ethical deployment on live targets not explored: governance, consent, rate limiting, safe operating envelopes, and organizational workflows for continuous testing in production remain open.
  • Stealth and evasion strategies unaddressed: no methods for WAF/IDS evasion, low-noise scanning, adaptive throttling, or polymorphic payloads.
  • Cloud-specific SSRF coverage narrow: examples target 169.254.169.254 only; need standardized testbeds across cloud providers (AWS/GCP/Azure) and IAM role/instance-profile exploitation validation.
  • Training data contamination not tested: no checks for memorization/regurgitation of known challenges, nor defenses against prompt injection or tool-output poisoning.
  • Observability/logging trade-offs unclear: “minimal logs” may hinder auditability; need privacy-preserving telemetry that still supports debugging, replay, and evidence provenance.
  • Benchmark diversity limited: results bound to XBOW; need cross-benchmark validation (Juice Shop, WebGoat, DVWA, modern enterprise apps) to test generalization.
  • Robustness against hardened defenses not assessed: performance against prepared statements, strict CSP, WAF, rate limiting, and defense-in-depth configurations remains unknown.
  • Role and tenant modeling missing: automated account creation, privilege matrices, multi-tenant isolation checks, and cross-tenant data leakage testing not implemented.
  • Out-of-band channels absent: lack of Collaborator-like infrastructure for blind and asynchronous vulns (DNS/SMTP callbacks, webhook relays).
  • Portfolio-scale orchestration unproven: per-job container design not evaluated for large portfolios, parallel scans, scheduling, and resource contention.
  • Validation oracle design needs standardization: real-world validation relies on side-effects; need class-specific oracles and reproducible evidence criteria across vuln types.
  • Formal planning/search not explored: no attack graphs, POMDP/RL planners, or curriculum learning for multi-step exploit chains and strategic exploration.
  • Prompt/tool safety not enforced: run_command/run_python expose risks if the model issues unsafe instructions; need guardrails, allowlists, sandboxed interpreters, and policy enforcement.

Practical Applications

Immediate Applications

The following applications can be deployed now using MAPTA’s current capabilities, performance characteristics, and open-source artifacts.

  • Continuous web application security scanning in CI/CD
    • Description: Integrate MAPTA as a pre-merge or nightly job to run autonomous penetration tests on each build and environment. Enforce low-cost early stopping (≈40 tool calls, ≈$0.30, ≈300 seconds) and require end-to-end exploit validation before blocking merges.
    • Sectors: Software, cloud platforms, e-commerce, fintech, SaaS.
    • Tools/products/workflows: MAPTA CI plugin; “Security Budget Controller” using resource caps; “Exploit Validation Oracle” generating PoCs with evidence; Slack alert workflow via send_slack_alert/send_slack_summary.
    • Assumptions/dependencies: Permissioned testing on dev/staging; Dockerized per-job environments; LLM access; organizational policy to accept automated blocking based on validated exploits.
  • Low-cost internal red teaming and attack surface monitoring
    • Description: Periodic autonomous blackbox scans of internet-facing apps and microservices, focusing on categories with strong performance (SSRF, misconfiguration, SQLi, SSTI, command injection).
    • Sectors: Finance, healthcare, critical SaaS, government.
    • Tools/products/workflows: MAPTA orchestrator with scheduled jobs; evidence packs (PoC, logs, reproduction steps) to feed vulnerability management platforms.
    • Assumptions/dependencies: Legal authorization for scans; isolated target environments to avoid production impact; acceptance of current blind SQLi/XSS limitations.
  • Open-source repository reviews and maintainer assistance
    • Description: Use whitebox mode to clone, deploy, and assess open-source projects; generate PoC-backed issues and advisories; streamline CVE submissions where appropriate.
    • Sectors: Open-source ecosystems, developer tooling.
    • Tools/products/workflows: “Maintainer Audit Bot” that opens issues with validated PoCs; repository action to auto-fire MAPTA on PRs affecting security-sensitive code.
    • Assumptions/dependencies: Responsible disclosure workflows; maintainers’ consent; reproducible local deployment in Docker; support for diverse stacks (Node.js, Python, Next.js).
  • Bug bounty triage and validation
    • Description: Security teams use MAPTA’s Validation agent to verify bounty submissions with concrete exploit execution before reward decisions.
    • Sectors: Platform companies, marketplaces, large web properties.
    • Tools/products/workflows: Intake pipeline that converts reported vectors into candidate PoCs and runs them in MAPTA’s sandbox; evidence gating to reduce false positives.
    • Assumptions/dependencies: Structured submission formats; safe sandboxing; legal controls for exploit execution; alignment with program rules.
  • SOC/DevSecOps augmentation for new deployments
    • Description: Automatically run a short autonomous pentest after each deployment, emphasizing high-yield categories (SSRF, misconfig, authorization flaws).
    • Sectors: Enterprise IT, cloud-native platforms.
    • Tools/products/workflows: Deployment hooks that trigger MAPTA; “Risk dashboard” summarizing findings with severity and exploit evidence.
    • Assumptions/dependencies: Access to deployment metadata/URLs; agreed escalation path for critical findings; budget controls per deployment.
  • Secrets and configuration exposure detection
    • Description: Detect client-side secret leaks and misconfigurations (e.g., API keys in JavaScript configs, overly permissive CORS, exposed admin interfaces).
    • Sectors: SaaS, AI-enabled apps, front-end heavy platforms.
    • Tools/products/workflows: Static-dynamic hybrid scan of served assets; rule packs emphasizing secret discovery and SSRF to cloud metadata endpoints.
    • Assumptions/dependencies: Availability of served assets and endpoints; handling of obfuscated builds; acceptance of automated discovery in staging/prod mirrors.
  • Rapid M&A/due diligence security checks
    • Description: Quick autonomous assessments of target companies’ web assets with PoC-backed results to inform acquisition risk.
    • Sectors: Finance, private equity, corporate development.
    • Tools/products/workflows: Standardized due diligence workbook populated from MAPTA runs; evidence-backed risk scoring.
    • Assumptions/dependencies: Legal permissions; scoped assets; time and cost budgets based on resource caps.
  • Security education and lab exercises
    • Description: Use XBOW + MAPTA to build hands-on courses on web security, multi-agent orchestration, and exploit validation.
    • Sectors: Academia, bootcamps, corporate training.
    • Tools/products/workflows: Course modules that replicate CTF runs; “observability worksheets” leveraging UsageTracker to teach tradeoffs in tokens/cost/time/tool calls.
    • Assumptions/dependencies: Classroom sandboxes; GPU/LLM access; curated exercises emphasizing categories where MAPTA performs well.
  • “Autonomous Pentest as a Service” for SMEs and startups
    • Description: Offer MAPTA-backed assessments at predictable costs, with optional retesting after remediation.
    • Sectors: SMBs, startups, agencies.
    • Tools/products/workflows: Managed service portal; scheduling; report generation with PoCs and remediation guidance; early-stop thresholds to protect budgets.
    • Assumptions/dependencies: Intake forms defining scope/authorization; scalable Docker infrastructure; customer alignment on exploit execution in isolated environments.

Long-Term Applications

These applications require further research, scaling, or development to reach production-grade reliability.

  • Timing-based and browser-centric exploitation improvements
    • Description: Extend agents with headless browser instrumentation and robust timing channels to tackle blind SQLi (currently 0% success) and advanced XSS variants (57% success).
    • Sectors: Software, e-commerce, ad-tech, social platforms.
    • Tools/products/workflows: “Browser Agent” with DOM inspection, CSP analysis, event simulation; “Timing Oracle” for latency-based inference.
    • Assumptions/dependencies: Reliable browser automation at scale; defenses against anti-bot; new training data for timing attacks; safe test harnesses.
  • Autonomous code remediation and secure-by-default CI gates
    • Description: Couple validated findings with automated patch generation and secure configuration hardening; propose PRs that include tests and PoC regression blockers.
    • Sectors: Software, cloud, platforms with strict compliance.
    • Tools/products/workflows: “AutoFixer” that drafts patches; policy engine requiring exploit-proof regression tests; integration with PenHeal-like remediation strategies.
    • Assumptions/dependencies: High-quality code understanding; organizational acceptance of AI-suggested changes; human-in-the-loop review.
  • Sector-specific adapters for regulated domains
    • Description: Tailor agents for healthcare (FHIR/HL7), finance (PCI, payment flows), and public sector workflows, including business logic vulnerabilities and multi-step attack chains.
    • Sectors: Healthcare, finance, government.
    • Tools/products/workflows: Domain-specific exploit libraries; compliance-aware testing policies; canary placement for business workflow validation.
    • Assumptions/dependencies: Deep domain models; legal and ethical guardrails; access to realistic test environments; data privacy constraints.
  • Enterprise vulnerability management integration and governance
    • Description: Embed MAPTA into enterprise platforms for asset inventory, risk scoring, SLA tracking, and budget-based scanning policies tied to risk tiers.
    • Sectors: Large enterprises, MSPs, MSSPs.
    • Tools/products/workflows: “Security Budget Controller” aligned with risk appetite; auto-prioritization based on exploit evidence; dashboards for audit readiness.
    • Assumptions/dependencies: API integrations with existing VM tools; organizational change management; reliable models for cost-to-risk tradeoff.
  • Standards and policy for autonomous penetration testing
    • Description: Develop guidelines for permissioning, safe PoC execution, evidence handling, responsible disclosure, and audit trail requirements; advocate inclusion in frameworks (ISO/IEC, NIST, PCI).
    • Sectors: Policy, compliance, audit.
    • Tools/products/workflows: Standard operating procedures; certification schemes for autonomous pentesting; regulatory playbooks for AI-driven security audits.
    • Assumptions/dependencies: Multi-stakeholder consensus; legal clarity on automated testing; regulator acceptance of PoC-based evidence.
  • Marketplace products: on-prem appliances and sovereign deployments
    • Description: Offer MAPTA as an on-prem appliance for privacy-sensitive sectors, or sovereign deployments with local LLMs to avoid data egress.
    • Sectors: Defense, healthcare, finance, public sector.
    • Tools/products/workflows: Hardened “Pentest Appliance” with local model serving; air-gapped Docker orchestration; compliance logging.
    • Assumptions/dependencies: High-quality local LLMs; performance parity with cloud models; operational maturity.
  • Multi-step business logic validation with canary instrumentation
    • Description: Improve validation of complex workflows by inserting canaries and telemetry hooks into application paths; reduce false negatives for theoretical-but-stateful exploits.
    • Sectors: Complex B2B SaaS, marketplaces, ERP/CRM systems.
    • Tools/products/workflows: “Workflow Canary Injector”; exploit path tracing; policy-driven validation oracles for intended vs. unintended behavior.
    • Assumptions/dependencies: Modifiable test environments; collaboration with app teams; standardized hooks.
  • Security economics and risk quantification
    • Description: Use MAPTA’s cost-performance telemetry (negative correlations between success and resource use) to inform risk models and budget policies.
    • Sectors: CFO office, GRC, cybersecurity leadership.
    • Tools/products/workflows: Budget-aware scanning strategies; cost-to-coverage analytics; risk thresholds that trigger deeper manual reviews.
    • Assumptions/dependencies: Data aggregation across portfolios; mapping exploit classes to business impact; governance adoption.
  • Continuous protection for no-code/low-code platforms
    • Description: Embed agents directly into citizen developer platforms for real-time feedback and exploit blocking while users build apps.
    • Sectors: No-code/low-code ecosystems, SMBs, internal tooling.
    • Tools/products/workflows: Live “Security Coach” with exploit validation; default-safe component libraries; auto-hardening templates.
    • Assumptions/dependencies: Platform integration; UX that balances guidance with freedom; scalable sandboxing; legal consent mechanisms.
  • Expanded benchmarks and open-science artifacts
    • Description: Grow reproducible datasets beyond XBOW, include modern SPAs, API-only backends, and microservice meshes; publish standardized protocols and Docker fixes.
    • Sectors: Academia, industry R&D.
    • Tools/products/workflows: New benchmark suites; reproducibility toolchains; community challenges for agent orchestration.
    • Assumptions/dependencies: Sustained funding; community participation; coverage of excluded OWASP categories (A08/A09) and network-level issues.

Open Problems

We found no open problems mentioned in this paper.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 583 likes about this paper.

alphaXiv