Papers
Topics
Authors
Recent
Search
2000 character limit reached

G-CTR: Game-Theoretic Guidance for LLM Pen Testing

Updated 13 January 2026
  • The paper introduces G-CTR, a lightweight game-theoretic guidance layer that extracts attack graphs and computes Nash equilibria within milliseconds.
  • G-CTR leverages a three-phase pipeline—graph extraction, equilibrium computation, and digest integration—to anchor LLM decisions with tactical insights.
  • Empirical benchmarks show up to 2.67× faster success times and 23× lower costs, highlighting significant improvements in strategic effectiveness.

Generative Cut-the-Rope (G-CTR) is a lightweight game-theoretic guidance layer designed to augment LLM-driven penetration-testing agents with machine-level strategic reasoning and reproducible performance. The system automatically extracts attack graphs from the agent's log context, computes Nash equilibria reflecting both attacker and defender strategies with effort-aware edge scoring, and feeds a concise digest of tactical guidance into the agent's prompt loop. By continuously closing this loop—agent generates graph, G-CTR analyzes, agent re-plans under guidance—the system reduces ambiguity, suppresses hallucinations, and anchors the model’s actions to statistically promising paths. Empirical results demonstrate substantial gains in success rate, consistency, and cost-efficiency, matching expert structure 70–90% of the time but at vastly increased speed and lower cost (Mayoral-Vilches et al., 9 Jan 2026).

1. System Architecture and Workflow

The G-CTR pipeline consists of three primary phases, executed at regular intervals (typically every ~80 tool calls or ~20 seconds):

Phase 1 (Game-Theoretic Analysis):

The agent’s log (ℓ), consisting of a sequence of J messages, is processed by an LLM to extract a structured attack graph G=(V,E)G = (V, E). NetworkX is employed to enforce acyclic paths and prune leaf nodes. The attack graph is represented with a merged root entry node ventryv_{entry} and vulnerable leaf nodes. G-CTR computes a Nash equilibrium over the graph—typically within 5\ll 5 ms overhead—informing optimal inspection strategies for defenders and attack paths for adversaries.

Phase 2 (Digest Generation):

Equilibrium outputs—defender mixed strategy σd\sigma_d^*, attacker’s probability-weighted paths, and overall success probability uu^*—are summarized in a digest using either algorithmic or LLM-based methods.

  • Algorithmic mode uses rule-based templates to mark bottlenecks (p<0.95p < 0.95) and high-risk transitions (p>0.90p > 0.90), with digest generation under 10 ms.
  • LLM mode sends a 350-word structured prompt to an LLM ("alias1"), generating lingustically rich summaries in 28\approx 28 s. Fallback to algorithmic occurs on API failure.

Phase 3 (Agent Execution):

The digest is prepended to the agent's system prompt. Subsequent action selection (ReAct framework) is performed with both tool outputs and G-CTR-derived strategic hints, iteratively updating GG until a termination condition is met (flag discovery or step limit) (Mayoral-Vilches et al., 9 Jan 2026).

2. Formal Game-Theoretic Model

G-CTR models attack graphs as directed acyclic graphs G=(V,E)G = (V, E), where each edge ii is assigned an effort score:

Ei=wmsgφmsg(i)+wtokφtok(i)+wcostφcost(i),wmsg+wtok+wcost=1E_i = w_{msg} \cdot \varphi_{msg}(i) + w_{tok} \cdot \varphi_{tok}(i) + w_{cost} \cdot \varphi_{cost}(i), \qquad w_{msg} + w_{tok} + w_{cost} = 1

with

  • φmsg(i)=1mi1J1\varphi_{msg}(i) = 1 - \frac{m_i-1}{J-1} (message distance)
  • φtok(i)=1tiT\varphi_{tok}(i) = 1 - \frac{t_i}{T} (token count)
  • φcost(i)=1ciC\varphi_{cost}(i) = 1 - \frac{c_i}{C} (estimated cost)

Ei[0,1]E_i \in [0,1] quantifies attacker effort; higher values indicate greater difficulty traversing the edge.

Defender’s expected detection probability (inspecting distribution σd\sigma_d over nodes AS1V{ventry}AS_1 \subset V \setminus \{v_{entry}\}) is:

UD(σd,π,θ)=cAS1σd(c)Pπ(cθ)U_D(\sigma_d, \pi, \theta) = \sum_{c \in AS_1} \sigma_d(c) \cdot P_\pi(c|\theta)

where Pπ(cθ)fPois(dπ(θ,c);λa)P_\pi(c|\theta) \propto f_{Pois}(d_\pi(\theta,c); \lambda_a) and λa=2\lambda_a = 2 (edges per defender inspection window). The attacker’s success is UA=1UDU_A = 1 - U_D.

The Nash equilibrium is computed by solving the zero-sum minimax:

σd=argminσdmaxπ,θUA(σd,π,θ)\sigma_d^* = \arg\min_{\sigma_d} \max_{\pi, \theta} U_A(\sigma_d, \pi, \theta)

or equivalently, by numerically minimizing uu:

minimize u subject to u1cσd(c)Pπ(cθ),(π,θ) cσd(c)=1,σd(c)0\text{minimize } u \ \text{subject to } u \geq 1 - \sum_c \sigma_d(c) \cdot P_\pi(c|\theta),\quad \forall (\pi,\theta) \ \sum_c \sigma_d(c) = 1,\quad \sigma_d(c) \geq 0

3. Attack Graph Extraction and Heuristics

Extraction proceeds via the following steps:

  1. Merge entry points: All nodes with minimal message_id are consolidated into the single root ventryv_{entry}.
  2. Prune cycles: Acyclicity enforced via NetworkX all_simple_paths; cycles are removed.
  3. Remove non-vulnerable leaves: Non-vulnerable leaves are recursively pruned.
  4. Enforce leaf vulnerabilities: Artificial leaf nodes (“leaf_X”) are attached to all vulnerable nodes, each with E=1E = 1.
  5. Reconnect components: Components are reconnected to root, and incoming edges to root are removed.
  6. Node-count bounds: Node count is set as a piecewise percentage of JJ (message sequence length):
    • Short (<70<70 msgs): $12$–16%16\% of JJ
    • Medium ($70$–$199$): $6$–12%12\%
    • Long (200\geq 200): $3.5$–5%5\% with hard bounds at [4,25][4,25] total nodes.

This approach enables efficient graph extraction, balancing computational tractability with faithful representation of agent reasoning (Mayoral-Vilches et al., 9 Jan 2026).

4. Equilibrium Computation and Digest Integration

G-CTR employs the following pseudocode for Nash equilibrium computation, using precomputed Poisson probabilities and a linear program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def ComputeNashEquilibrium(G, λ):
    # G = (V,E) weighted by E_i on edges; AS1 = V\{leaves,entry}, AS2 = all_simple_paths(entry→vuln)
    paths = list(all_simple_paths(G, entry, any_leaf))
    # Precompute P_π(c|θ) via Poisson with rate λ
    P = { (π,θ,c): PoisPMF(dist(π,θ,c),λ)/Z(π) for π in paths for θ in π for c in AS1 }
    # LP variables: σ_d(c) ∀c∈AS1, u
    lp = LinearProgram()
    σ = {c: lp.var(f"sigma_{c}", lb=0) for c in AS1}
    u = lp.var("u")
    lp.set_objective(u, minimize=True)
    for π in paths:
        for θ in π:
           lhs = 1 - sum(σ[c]*P[(π,θ,c)] for c in AS1)
           lp.add_constraint(u >= lhs)
    lp.add_constraint(sum(σ.values()) == 1)
    sol = lp.solve()
    σ_d = {c: sol.value(σ[c]) for c in AS1}
    # Attacker best responses: filter paths with max U_A under σ_d
    attacker_paths = sorted(
        paths, key=lambda π: max(1-sum(σ_d[c]*P[(π,θ,c)] for c in AS1) for θ in π), reverse=True
    )
    return σ_d, attacker_paths, sol.value(u)

Digest generation employs algorithmic or LLM-based summarization, with thresholds for bottlenecks (p<0.95p<0.95) and high-risk transitions (p>0.90p>0.90):

1
2
3
4
5
6
7
8
9
10
11
G-CTR Security Analysis
Identified attack paths:
  • Path 1: entry→… [67%] → target
  • Path 2: entry→… [32%] → target
Critical chokepoints (bottlenecks):
  • node X→Y: 3.1% success
High-risk transitions:
  • node A→B: 95% success
Tactical Guidance:
  • Focus next on exploiting …
  • Defend inspection at …

Digest is injected directly into the agent’s system prompt:

1
2
3
System: You are a red/blue agent. Use the following G-CTR digest to guide your next actions: {D}
User: {…}
Assistant:
This prompt replacement anchors subsequent LLM outputs to the strategic guidance (Mayoral-Vilches et al., 9 Jan 2026).

5. Empirical Performance and Quantitative Benchmarks

G-CTR demonstrates significant improvements in speed, cost, and agent effectiveness across multiple testbeds:

Attack-Graph Generation vs. Human Experts:

  • Node correspondence: $70$–90%90\% match to expert graphs (five domains).
  • Time: LLMs $10$–$46$ s versus human $30$–$90$ min ($60$–245×245\times faster).
  • Cost: $\$0.05\$0.64(API)versus (API) versus \$22.5\$67.5(human)( (human) (62450\timescheaper).</li><li>Equilibriumoverhead: cheaper).</li> <li>Equilibrium overhead: <5$ ms per run.</li> </ul> <p><strong>Shellshock CVE-2014-6271 Cyber-Range (44 runs):</strong></p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Mode</th> <th>Success Rate</th> <th>Avg Duration</th> <th>Tool-use Variance</th> <th>Cost/Succ</th> </tr> </thead><tbody><tr> <td>No G-CTR</td> <td>13.3% (2/15)</td> <td>16.7 min</td> <td>1.6×</td> <td>\$2.71 G-CTR Algorithmic 20.0% (3/15) 22.5 min 6.2× \$0.32 G-CTR LLM 42.9% (6/14) 20.2 min 1.2× \$0.12
  • Expected time to success E[T]=Tavg/PsuccE[T]=T_{avg}/P_{succ}: $126$ min 47\rightarrow 47 min (2.67×2.67\times faster).
  • Cost per success: $\$2.71 \rightarrow \$0.12( (23\timescheaper).</li><li>Variancereduction: cheaper).</li> <li>Variance reduction: 6.2\times \rightarrow 1.2\times( (5.2\timeslower).</li></ul><p><strong>AttackandDefenseCTFs(25matcheseach):</strong></p><divclass=overflowxautomaxwfullmy4><tableclass=tablebordercollapsewfullstyle=tablelayout:fixed><thead><tr><th>TeamConfiguration</th><thstyle="textalign:right">CowsayWin<thstyle="textalign:right">Loss<thstyle="textalign:right">PingpongWin<thstyle="textalign:right">Loss</tr></thead><tbody><tr><td>NoGCTR(baseline)</td><tdstyle="textalign:right">28.6</td><tdstyle="textalign:right">52.4</td><tdstyle="textalign:right">28.6</td><tdstyle="textalign:right">52.4</td></tr><tr><td>RedGCTR(attackeronly)</td><tdstyle="textalign:right">33.3</td><tdstyle="textalign:right">42.9</td><tdstyle="textalign:right">19.0</td><tdstyle="textalign:right">61.9</td></tr><tr><td>BlueGCTR(defenseonly)</td><tdstyle="textalign:right">57.1</td><tdstyle="textalign:right">28.6</td><tdstyle="textalign:right">25.0</td><tdstyle="textalign:right">75.0</td></tr><tr><td>PurpleGCTR(dual,sep)</td><tdstyle="textalign:right">52.9</td><tdstyle="textalign:right">23.5</td><tdstyle="textalign:right">13.6</td><tdstyle="textalign:right">86.4</td></tr><tr><td>PurpleGCTRmerged(shared)</td><tdstyle="textalign:right">55.0</td><tdstyle="textalign:right">15.0</td><tdstyle="textalign:right">52.4</td><tdstyle="textalign:right">28.6</td></tr></tbody></table></div><p>SharingasingleGCTRgraph(Purplemerged)achievesbestoutcomes: lower).</li> </ul> <p><strong>Attack-and-Defense CTFs (25 matches each):</strong></p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Team Configuration</th> <th style="text-align: right">Cowsay Win%</th> <th style="text-align: right">Loss%</th> <th style="text-align: right">Pingpong Win%</th> <th style="text-align: right">Loss%</th> </tr> </thead><tbody><tr> <td>No G-CTR (baseline)</td> <td style="text-align: right">28.6</td> <td style="text-align: right">52.4</td> <td style="text-align: right">28.6</td> <td style="text-align: right">52.4</td> </tr> <tr> <td>Red G-CTR (attacker only)</td> <td style="text-align: right">33.3</td> <td style="text-align: right">42.9</td> <td style="text-align: right">19.0</td> <td style="text-align: right">61.9</td> </tr> <tr> <td>Blue G-CTR (defense only)</td> <td style="text-align: right">57.1</td> <td style="text-align: right">28.6</td> <td style="text-align: right">25.0</td> <td style="text-align: right">75.0*</td> </tr> <tr> <td>Purple G-CTR (dual, sep)</td> <td style="text-align: right">52.9</td> <td style="text-align: right">23.5</td> <td style="text-align: right">13.6</td> <td style="text-align: right">86.4</td> </tr> <tr> <td>Purple G-CTRₘₑᵣgₑd (shared)</td> <td style="text-align: right">55.0</td> <td style="text-align: right">15.0</td> <td style="text-align: right">52.4</td> <td style="text-align: right">28.6</td> </tr> </tbody></table></div> <p>Sharing a single G-CTR graph (Purple merged) achieves best outcomes: \sim2{:}1winoverbaseline, win over baseline, \sim3.7{:}1overseparatedualguidedteams(<ahref="/papers/2601.05887"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">MayoralVilchesetal.,9Jan2026</a>).</p><h2class=paperheadingid=strategicimpactsearchcollapseandhallucinationsuppression>6.StrategicImpact:SearchCollapseandHallucinationSuppression</h2><p><ahref="https://www.emergentmind.com/topics/closedloopintegration"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Closedloopintegration</a>ofgametheoreticequilibriuminsightsfundamentallyconstrainstheLLMsreasoningspace.Byprovidingacontinuousexternalequilibriumsignal,GCTRanchorsactionselectiontostatisticallyvalidpathsandchokepoints.Thisreanchoringyields:</p><ul><li> over separate dual-guided teams (<a href="/papers/2601.05887" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayoral-Vilches et al., 9 Jan 2026</a>).</p> <h2 class='paper-heading' id='strategic-impact-search-collapse-and-hallucination-suppression'>6. Strategic Impact: Search Collapse and Hallucination Suppression</h2> <p><a href="https://www.emergentmind.com/topics/closed-loop-integration" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Closed-loop integration</a> of game-theoretic equilibrium insights fundamentally constrains the LLM’s reasoning space. By providing a continuous external equilibrium signal, G-CTR anchors action selection to statistically valid paths and choke points. This re-anchoring yields:</p> <ul> <li>5.2\timesreductionintoolusevariance</li><li> reduction in tool-use variance</li> <li>2\timesincreaseinsuccessratesbeyondrawLLMactionselection</li><li>Dramaticreductioninhallucinatedactionsirrelevantordeadendbehaviors</li></ul><p>Thissuggestsclosedloopguidanceisessentialforsuperintelligentcybersecurityagents,providingmachinescalereproducibilitythatapproacheshumanstrategicintuition(<ahref="/papers/2601.05887"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">MayoralVilchesetal.,9Jan2026</a>).</p><h2class=paperheadingid=limitationsandfuturedirections>7.LimitationsandFutureDirections</h2><p>SeveralinherentlimitationsdefinecurrentGCTRdeployments:</p><ul><li>Graphsizeandcomplexityboundsemployheuristics;adaptiveordomainspecifictuningcouldfurtherimprove<ahref="https://www.emergentmind.com/topics/fidelityalphaprecision"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">fidelity</a>.</li><li><ahref="https://www.emergentmind.com/topics/promptingstrategies"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Promptingstrategies</a>remainadhoc;targeted<ahref="https://www.emergentmind.com/topics/promptengineering"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">promptengineering</a>mightenhancebothextractionandsummarizationquality.</li><li>Existinggraphs(615nodesforCTFs)maybetoocoarseforlargescaleenterprisenetworks,motivatinghierarchicalormultiscalerepresentations.</li><li>Equilibriumsolversscaleefficientlyto increase in success rates beyond raw LLM action selection</li> <li>Dramatic reduction in “hallucinated” actions—irrelevant or dead-end behaviors</li> </ul> <p>This suggests closed-loop guidance is essential for superintelligent cybersecurity agents, providing machine-scale reproducibility that approaches human strategic intuition (<a href="/papers/2601.05887" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayoral-Vilches et al., 9 Jan 2026</a>).</p> <h2 class='paper-heading' id='limitations-and-future-directions'>7. Limitations and Future Directions</h2> <p>Several inherent limitations define current G-CTR deployments:</p> <ul> <li>Graph-size and complexity bounds employ heuristics; adaptive or domain-specific tuning could further improve <a href="https://www.emergentmind.com/topics/fidelity-alpha-precision" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">fidelity</a>.</li> <li><a href="https://www.emergentmind.com/topics/prompting-strategies" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prompting strategies</a> remain ad hoc; targeted <a href="https://www.emergentmind.com/topics/prompt-engineering" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">prompt engineering</a> might enhance both extraction and summarization quality.</li> <li>Existing graphs (6–15 nodes for CTFs) may be too coarse for large-scale enterprise networks, motivating hierarchical or multi-scale representations.</li> <li>Equilibrium solvers scale efficiently to \leq25$ nodes, but larger graphs could require approximations or Monte Carlo algorithms.
  • Future work targets dynamic LLM temperature schedules (balancing creativity and control), adversarial robustness against poisoned logs, and integration with probabilistic vulnerability databases (e.g., CVSS alongside effort scores).

A plausible implication is that ongoing advances in adaptive graph construction, scalable equilibrium computation, and robust prompting will further expand G-CTR’s domain applicability and strategic capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Cut-the-Rope (G-CTR).