G-CTR: Game-Theoretic Guidance for LLM Pen Testing

Updated 13 January 2026

The paper introduces G-CTR, a lightweight game-theoretic guidance layer that extracts attack graphs and computes Nash equilibria within milliseconds.
G-CTR leverages a three-phase pipeline—graph extraction, equilibrium computation, and digest integration—to anchor LLM decisions with tactical insights.
Empirical benchmarks show up to 2.67× faster success times and 23× lower costs, highlighting significant improvements in strategic effectiveness.

Generative Cut-the-Rope (G-CTR) is a lightweight game-theoretic guidance layer designed to augment LLM-driven penetration-testing agents with machine-level strategic reasoning and reproducible performance. The system automatically extracts attack graphs from the agent's log context, computes Nash equilibria reflecting both attacker and defender strategies with effort-aware edge scoring, and feeds a concise digest of tactical guidance into the agent's prompt loop. By continuously closing this loop—agent generates graph, G-CTR analyzes, agent re-plans under guidance—the system reduces ambiguity, suppresses hallucinations, and anchors the model’s actions to statistically promising paths. Empirical results demonstrate substantial gains in success rate, consistency, and cost-efficiency, matching expert structure 70–90% of the time but at vastly increased speed and lower cost (Mayoral-Vilches et al., 9 Jan 2026).

1. System Architecture and Workflow

The G-CTR pipeline consists of three primary phases, executed at regular intervals (typically every ~80 tool calls or ~20 seconds):

Phase 1 (Game-Theoretic Analysis):

The agent’s log (ℓ), consisting of a sequence of J messages, is processed by an LLM to extract a structured attack graph $G = (V, E)$ . NetworkX is employed to enforce acyclic paths and prune leaf nodes. The attack graph is represented with a merged root entry node $v_{entry}$ and vulnerable leaf nodes. G-CTR computes a Nash equilibrium over the graph—typically within $\ll 5$ ms overhead—informing optimal inspection strategies for defenders and attack paths for adversaries.

Phase 2 (Digest Generation):

Equilibrium outputs—defender mixed strategy $\sigma_d^*$ , attacker’s probability-weighted paths, and overall success probability $u^*$ —are summarized in a digest using either algorithmic or LLM-based methods.

Algorithmic mode uses rule-based templates to mark bottlenecks ( $p < 0.95$ ) and high-risk transitions ( $p > 0.90$ ), with digest generation under 10 ms.
LLM mode sends a 350-word structured prompt to an LLM ("alias1"), generating lingustically rich summaries in $\approx 28$ s. Fallback to algorithmic occurs on API failure.

Phase 3 (Agent Execution):

The digest is prepended to the agent's system prompt. Subsequent action selection (ReAct framework) is performed with both tool outputs and G-CTR-derived strategic hints, iteratively updating $G$ until a termination condition is met (flag discovery or step limit) (Mayoral-Vilches et al., 9 Jan 2026).

2. Formal Game-Theoretic Model

G-CTR models attack graphs as directed acyclic graphs $G = (V, E)$ , where each edge $i$ is assigned an effort score:

$E_i = w_{msg} \cdot \varphi_{msg}(i) + w_{tok} \cdot \varphi_{tok}(i) + w_{cost} \cdot \varphi_{cost}(i), \qquad w_{msg} + w_{tok} + w_{cost} = 1$

with

$\varphi_{msg}(i) = 1 - \frac{m_i-1}{J-1}$ (message distance)
$\varphi_{tok}(i) = 1 - \frac{t_i}{T}$ (token count)
$\varphi_{cost}(i) = 1 - \frac{c_i}{C}$ (estimated cost)

$E_i \in [0,1]$ quantifies attacker effort; higher values indicate greater difficulty traversing the edge.

Defender’s expected detection probability (inspecting distribution $\sigma_d$ over nodes $AS_1 \subset V \setminus \{v_{entry}\}$ ) is:

$U_D(\sigma_d, \pi, \theta) = \sum_{c \in AS_1} \sigma_d(c) \cdot P_\pi(c|\theta)$

where $P_\pi(c|\theta) \propto f_{Pois}(d_\pi(\theta,c); \lambda_a)$ and $\lambda_a = 2$ (edges per defender inspection window). The attacker’s success is $U_A = 1 - U_D$ .

The Nash equilibrium is computed by solving the zero-sum minimax:

$\sigma_d^* = \arg\min_{\sigma_d} \max_{\pi, \theta} U_A(\sigma_d, \pi, \theta)$

or equivalently, by numerically minimizing $u$ :

$\text{minimize } u \ \text{subject to } u \geq 1 - \sum_c \sigma_d(c) \cdot P_\pi(c|\theta),\quad \forall (\pi,\theta) \ \sum_c \sigma_d(c) = 1,\quad \sigma_d(c) \geq 0$

3. Attack Graph Extraction and Heuristics

Extraction proceeds via the following steps:

Merge entry points: All nodes with minimal message_id are consolidated into the single root $v_{entry}$ .
Prune cycles: Acyclicity enforced via NetworkX all_simple_paths; cycles are removed.
Remove non-vulnerable leaves: Non-vulnerable leaves are recursively pruned.
Enforce leaf vulnerabilities: Artificial leaf nodes (“leaf_X”) are attached to all vulnerable nodes, each with $E = 1$ .
Reconnect components: Components are reconnected to root, and incoming edges to root are removed.
Node-count bounds: Node count is set as a piecewise percentage of $J$ $J$ (message sequence length):
- Short ( $<70$ msgs): $12$– $16\%$ of $J$
- Medium ($70$–$199$): $6$– $12\%$
- Long ( $\geq 200$ ): $3.5$– $5\%$ with hard bounds at $[4,25]$ total nodes.

This approach enables efficient graph extraction, balancing computational tractability with faithful representation of agent reasoning (Mayoral-Vilches et al., 9 Jan 2026).

4. Equilibrium Computation and Digest Integration

G-CTR employs the following pseudocode for Nash equilibrium computation, using precomputed Poisson probabilities and a linear program:

def ComputeNashEquilibrium(G, λ):
    # G = (V,E) weighted by E_i on edges; AS1 = V\{leaves,entry}, AS2 = all_simple_paths(entry→vuln)
    paths = list(all_simple_paths(G, entry, any_leaf))
    # Precompute P_π(c|θ) via Poisson with rate λ
    P = { (π,θ,c): PoisPMF(dist(π,θ,c),λ)/Z(π) for π in paths for θ in π for c in AS1 }
    # LP variables: σ_d(c) ∀c∈AS1, u
    lp = LinearProgram()
    σ = {c: lp.var(f"sigma_{c}", lb=0) for c in AS1}
    u = lp.var("u")
    lp.set_objective(u, minimize=True)
    for π in paths:
        for θ in π:
           lhs = 1 - sum(σ[c]*P[(π,θ,c)] for c in AS1)
           lp.add_constraint(u >= lhs)
    lp.add_constraint(sum(σ.values()) == 1)
    sol = lp.solve()
    σ_d = {c: sol.value(σ[c]) for c in AS1}
    # Attacker best responses: filter paths with max U_A under σ_d
    attacker_paths = sorted(
        paths, key=lambda π: max(1-sum(σ_d[c]*P[(π,θ,c)] for c in AS1) for θ in π), reverse=True
    )
    return σ_d, attacker_paths, sol.value(u)

Digest generation employs algorithmic or LLM-based summarization, with thresholds for bottlenecks ( $p<0.95$ ) and high-risk transitions ( $p>0.90$ ):

G-CTR Security Analysis
Identified attack paths:
  • Path 1: entry→… [67%] → target
  • Path 2: entry→… [32%] → target
Critical chokepoints (bottlenecks):
  • node X→Y: 3.1% success
High-risk transitions:
  • node A→B: 95% success
Tactical Guidance:
  • Focus next on exploiting …
  • Defend inspection at …

Digest is injected directly into the agent’s system prompt:

1
2
3

System: You are a red/blue agent. Use the following G-CTR digest to guide your next actions: {D}
User: {…}
Assistant:

This prompt replacement anchors subsequent LLM outputs to the strategic guidance (Mayoral-Vilches et al., 9 Jan 2026).

5. Empirical Performance and Quantitative Benchmarks

G-CTR demonstrates significant improvements in speed, cost, and agent effectiveness across multiple testbeds:

Attack-Graph Generation vs. Human Experts:

Node correspondence: $70$– $90\%$ match to expert graphs (five domains).
Time: LLMs $10$–$46$ s versus human $30$–$90$ min ($60$– $245\times$ faster).
Cost: $\$0.05 $–$ \$0.64 $(API) versus$ \$22.5 $–$ \$67.5 $(human) ($ 62 $–$ 450\times $cheaper).</li> <li>Equilibrium overhead:$ <5$ ms per run.</li> </ul> <p><strong>Shellshock CVE-2014-6271 Cyber-Range (44 runs):</strong></p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Mode</th> <th>Success Rate</th> <th>Avg Duration</th> <th>Tool-use Variance</th> <th>Cost/Succ</th> </tr> </thead><tbody><tr> <td>No G-CTR</td> <td>13.3% (2/15)</td> <td>16.7 min</td> <td>1.6×</td> <td>\$2.71 G-CTR Algorithmic 20.0% (3/15) 22.5 min 6.2× \$0.32 G-CTR LLM 42.9% (6/14) 20.2 min 1.2× \$0.12

Expected time to success $E[T]=T_{avg}/P_{succ}$ : $126$ min $\rightarrow 47$ min ( $2.67\times$ faster).
Cost per success: $\$2.71 \rightarrow \$0.12 $($ 23\times $cheaper).</li> <li>Variance reduction:$ 6.2\times \rightarrow 1.2\times $($ 5.2\times $lower).</li> </ul> <p><strong>Attack-and-Defense CTFs (25 matches each):</strong></p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Team Configuration</th> <th style="text-align: right">Cowsay Win%</th> <th style="text-align: right">Loss%</th> <th style="text-align: right">Pingpong Win%</th> <th style="text-align: right">Loss%</th> </tr> </thead><tbody><tr> <td>No G-CTR (baseline)</td> <td style="text-align: right">28.6</td> <td style="text-align: right">52.4</td> <td style="text-align: right">28.6</td> <td style="text-align: right">52.4</td> </tr> <tr> <td>Red G-CTR (attacker only)</td> <td style="text-align: right">33.3</td> <td style="text-align: right">42.9</td> <td style="text-align: right">19.0</td> <td style="text-align: right">61.9</td> </tr> <tr> <td>Blue G-CTR (defense only)</td> <td style="text-align: right">57.1</td> <td style="text-align: right">28.6</td> <td style="text-align: right">25.0</td> <td style="text-align: right">75.0*</td> </tr> <tr> <td>Purple G-CTR (dual, sep)</td> <td style="text-align: right">52.9</td> <td style="text-align: right">23.5</td> <td style="text-align: right">13.6</td> <td style="text-align: right">86.4</td> </tr> <tr> <td>Purple G-CTRₘₑᵣgₑd (shared)</td> <td style="text-align: right">55.0</td> <td style="text-align: right">15.0</td> <td style="text-align: right">52.4</td> <td style="text-align: right">28.6</td> </tr> </tbody></table></div> <p>Sharing a single G-CTR graph (Purple merged) achieves best outcomes:$ \sim2{:}1 $win over baseline,$ \sim3.7{:}1 $over separate dual-guided teams (<a href="/papers/2601.05887" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayoral-Vilches et al., 9 Jan 2026</a>).</p> <h2 class='paper-heading' id='strategic-impact-search-collapse-and-hallucination-suppression'>6. Strategic Impact: Search Collapse and Hallucination Suppression</h2> <p><a href="https://www.emergentmind.com/topics/closed-loop-integration" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Closed-loop integration</a> of game-theoretic equilibrium insights fundamentally constrains the LLM’s reasoning space. By providing a continuous external equilibrium signal, G-CTR anchors action selection to statistically valid paths and choke points. This re-anchoring yields:</p> <ul> <li>$ 5.2\times $reduction in tool-use variance</li> <li>$ 2\times $increase in success rates beyond raw LLM action selection</li> <li>Dramatic reduction in “hallucinated” actions—irrelevant or dead-end behaviors</li> </ul> <p>This suggests closed-loop guidance is essential for superintelligent cybersecurity agents, providing machine-scale reproducibility that approaches human strategic intuition (<a href="/papers/2601.05887" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayoral-Vilches et al., 9 Jan 2026</a>).</p> <h2 class='paper-heading' id='limitations-and-future-directions'>7. Limitations and Future Directions</h2> <p>Several inherent limitations define current G-CTR deployments:</p> <ul> <li>Graph-size and complexity bounds employ heuristics; adaptive or domain-specific tuning could further improve <a href="https://www.emergentmind.com/topics/fidelity-alpha-precision" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">fidelity</a>.</li> <li><a href="https://www.emergentmind.com/topics/prompting-strategies" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prompting strategies</a> remain ad hoc; targeted <a href="https://www.emergentmind.com/topics/prompt-engineering" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">prompt engineering</a> might enhance both extraction and summarization quality.</li> <li>Existing graphs (6–15 nodes for CTFs) may be too coarse for large-scale enterprise networks, motivating hierarchical or multi-scale representations.</li> <li>Equilibrium solvers scale efficiently to$ \leq25$ nodes, but larger graphs could require approximations or Monte Carlo algorithms.
Future work targets dynamic LLM temperature schedules (balancing creativity and control), adversarial robustness against poisoned logs, and integration with probabilistic vulnerability databases (e.g., CVSS alongside effort scores).

A plausible implication is that ongoing advances in adaptive graph construction, scalable equilibrium computation, and robust prompting will further expand G-CTR’s domain applicability and strategic capabilities.

Markdown Report Issue Upgrade to Chat

References (1)

Cybersecurity AI: A Game-Theoretic AI for Guiding Attack and Defense (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Cut-the-Rope (G-CTR).

G-CTR: Game-Theoretic Guidance for LLM Pen Testing

1. System Architecture and Workflow

2. Formal Game-Theoretic Model

3. Attack Graph Extraction and Heuristics

4. Equilibrium Computation and Digest Integration

5. Empirical Performance and Quantitative Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

G-CTR: Game-Theoretic Guidance for LLM Pen Testing

1. System Architecture and Workflow

2. Formal Game-Theoretic Model

3. Attack Graph Extraction and Heuristics

4. Equilibrium Computation and Digest Integration

5. Empirical Performance and Quantitative Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research