AlphaZero-Based Topology Optimization

Updated 3 February 2026

The paper introduces an AlphaZero-inspired reinforcement learning framework that minimizes costly redispatch and curtailment in power grids.
It formulates topology optimization as a Markov Decision Process with a constrained action set and a shaped reward function to maintain grid integrity.
The method leverages neural networks and Monte Carlo Tree Search to efficiently navigate a vast action space, validated through large-scale simulation results.

AlphaZero-based topology optimization is a reinforcement learning–driven framework for congestion management in power grids that employs an AlphaZero-inspired agent to select optimal grid topology actions. The primary goal is to minimize costly redispatch and curtailment while preserving grid security, especially in the context of increasing renewable generation. This approach customizes the AlphaZero methodology for grid operations by adapting the action space, neural network structure, and reward design to the specifics of power system constraints and objectives, demonstrated at large scale in the WCCI 2022 Learning to Run a Power Network (L2RPN) competition (Dorfer et al., 2022).

1. Problem Definition and Markov Decision Process Formulation

Topology optimization in power grids is formulated as a Markov Decision Process (MDP) where the agent sequentially selects topology-switching actions to maintain secure, reliable operation.

State $s_t$ : At each time step $t$ , the state vector encodes the full grid configuration, including topology (bus assignments), generator injections (controllable and renewables), load consumptions, storage states, current line flows $\theta_{i,j}$ , line-loading ratios $\rho_i = |\text{flow}_i| / \text{capacity}_i$ , and the number of lines offline.
Action $a_t$ : Actions are “unitary” topology switches, each corresponding to a substation bus reassignment, drawn from a reduced catalogue $A$ of approximately 2,000 frequent switches (out of about 72,000 possible). For the extended joint agent, actions also include redispatch vectors $\Delta g$ subject to generation and operational limits.
Transition dynamics: Applying action $a_t$ (and, if relevant, redispatch $\Delta g_t$ ), the agent transitions to $s_{t+1} = \mathcal{T}(s_t, a_t)$ using a full AC load-flow simulator. Lines are automatically disconnected under severe overload (overflow $t$ 0 or sustained overcapacity), leading to a blackout upon grid islanding or infeasible load-generation mismatch.
Terminal conditions: An episode terminates at a grid blackout or after a fixed horizon $t$ 1 (2016 steps, simulating one week at 5-minute intervals).

2. Reward Specification and Objective

The framework specifies a shaped reward function that penalizes violations of thermal constraints and loss of grid components to balance short-term congestion management and long-term system integrity.

Overflow penalty $t$ 2:
- If $t$ 3, $t$ 4.
- If any $t$ 5, $t$ 6.
Reward: $t$ 7, where $t$ 8 is the number of disconnected lines.
Objective:

$t$ 9

with discount factor $\theta_{i,j}$ 0.

For the joint topology+redispatch agent, the composite reward incorporates redispatch ( $\theta_{i,j}$ 1) and curtailment ( $\theta_{i,j}$ 2) costs:

$\theta_{i,j}$ 3

Parameters $\theta_{i,j}$ 4 prioritize congestion relief.

3. AlphaZero Algorithm Adaptation

The AlphaZero approach is specialized for the high-dimensional, constrained topology optimization problem. Key components are:

3.1 Neural Network Architecture

Inputs: Features concatenated per-line and per-bus ( $\theta_{i,j}$ 5, flows, voltage magnitudes, bus assignments), generator outputs, load demands, and time-of-day encoding.
Backbone: 8 residual blocks (Conv $\theta_{i,j}$ 6, 128 channels, with ReLU and BatchNorm; $\theta_{i,j}$ 71.2M parameters).
Policy head: Two $\theta_{i,j}$ 8 convolutions, flattened and passed to a fully connected layer for $\theta_{i,j}$ 9 logits; output is softmax action distribution $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 0 over reduced action space.
Value head: Replaced with a non-parametric heuristic: $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 1.

3.2 Monte Carlo Tree Search (MCTS) and Training

MCTS with PUCT: Node selection uses

$\rho_i = |\text{flow}_i| / \text{capacity}_i$ 2

with $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 3 tuned to $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 4.

Action pruning: Only actions in $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 5 are allowed; those immediately leading to subnet disconnection are masked.
Early stopping: Search halts if at least $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 6 “recovery” nodes are found, where a safety monitor ensures the next $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 7 steps remain below $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 8 of line capacity ( $\rho_i = |\text{flow}_i| / \text{capacity}_i$ 9).
Self-play and policy training:

At each step, MCTS yields a visit-count distribution $a_t$ 0; training minimizes cross-entropy between $a_t$ 1 and $a_t$ 2. The value head is not trained.

$a_t$ 3

Training regime: Episodes conducted over realistic fixed load/generation profiles with terminal blackouts or horizon $a_t$ 4.

3.3 Constraint Handling

Simulation-level enforcement: All environment transitions are computed via an AC load-flow simulator (Grid2Op + lightsim2grid), which applies Kirchoff’s laws, voltage and thermal limits. Invalid actions (e.g., those that would isolate a generator) are pruned.
Feasibility: Any MCTS branch resulting in line overload $a_t$ 5 is flagged terminal with a heavy negative reward.
Soft constraints: The shaped reward function includes a penalty of $a_t$ 6 per offline line.

4. Experimental Evaluation

Experiments are conducted on a large-scale simulated transmission grid corresponding to the WCCI 2022 L2RPN competition testbed:

Grid: 118 substations, 186 lines, 62 generators, 91 loads, 7 storages.
Episode: One-week (2016 time steps at 5-min intervals), real-world profiles, with random line-disconnect events to test $a_t$ 7 security.

Baselines

Multiple baseline and ablation settings are compared:

Agent Name	Description
NoOp (BL)	Do nothing
R (BL)	Redispatch + curtailment via Cross-Entropy optimizer
T (brute-force)	Try all 2,000 switches, pick best step
T (arg-max)	Greedy policy network top-1
T (top-25)	Simulate top-25 policy actions, pick best
T (MCTS, oracle)	MCTS with full simulator lookahead (upper bound)
T (top-5)+R	Combine top-5 policy candidates with redispatch (superimposed)

Key Results

Performance is assessed on “steps survived” (fraction of $a_t$ 8), average decision time, average redispatch, and curtailment per step.

Agent	Steps Survived (%)	Step Time (ms)	Redispatch (MW)	Curtailment (MW)
NoOp (BL)	19.2	8.9	0.0	0.0
R (BL)	74.5	31.0	504.2	484.4
T (brute-force)	61.1	153.3	0.0	0.0
T (arg-max)	50.4	13.4	0.0	0.0
T (top-25)	65.3	34.2	0.0	0.0
T (MCTS,oracle)	76.9	1714.2	0.0	0.0
T (top-5)+R	82.1	53.3	202.8	193.4

The T(top-5)+R agent achieves 82.1% survival (versus 74.5% for redispatch-only) and reduces redispatch to 40% of R(BL). The method ranked 1st in the WCCI 2022 L2RPN competition (Dorfer et al., 2022).

5. Technical Challenges and Research Directions

5.1 Deployment Challenges

Scalability: The present use of a reduced action catalogue (2,000 out of $a_t$ 9 theoretical topology switches) is ad hoc. Further progress is needed in defining domain-informed macro-actions or hierarchical RL decompositions.
Market Coordination: Topology changes must integrate with market schedules, redispatch contracts, and remedial action pricing mechanisms.
Uncertainty: Handling forecast errors in renewables, load, and $A$ 0 contingencies remains an outstanding challenge.
Computational Efficiency: Real-time deployment requires substantially reduced inference and search latency, motivating advances in hardware and algorithmic optimization.

5.2 Algorithmic Enhancements

Hybrid Value Estimation: Co-training a neural value head alongside the heuristic or integrating a learned transition model (MuZero-style) may enhance planning capability.
Hierarchical/Hindsight Action Pooling: Aggregating substation-level action patterns based on electrical characteristics could enable effective exploration of larger action spaces.
Joint Training: Formulating joint topology and redispatch control within multi-agent RL or centralized frameworks may improve overall grid performance.
Curriculum Learning: Progressive scaling from IEEE-14/118-bus systems to large grids is recommended for stable policy learning.

6. Implementation Summary

A sketch of the training algorithm follows:

$A$ 1

This design enables reproducibility and extensibility for AlphaZero-style topology optimization applied to high-fidelity, operationally-constrained power grids, as detailed in (Dorfer et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Power Grid Congestion Management via Topology Optimization with AlphaZero (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AlphaZero-Based Topology Optimization.