Papers
Topics
Authors
Recent
Search
2000 character limit reached

AlphaZero-Based Topology Optimization

Updated 3 February 2026
  • The paper introduces an AlphaZero-inspired reinforcement learning framework that minimizes costly redispatch and curtailment in power grids.
  • It formulates topology optimization as a Markov Decision Process with a constrained action set and a shaped reward function to maintain grid integrity.
  • The method leverages neural networks and Monte Carlo Tree Search to efficiently navigate a vast action space, validated through large-scale simulation results.

AlphaZero-based topology optimization is a reinforcement learning–driven framework for congestion management in power grids that employs an AlphaZero-inspired agent to select optimal grid topology actions. The primary goal is to minimize costly redispatch and curtailment while preserving grid security, especially in the context of increasing renewable generation. This approach customizes the AlphaZero methodology for grid operations by adapting the action space, neural network structure, and reward design to the specifics of power system constraints and objectives, demonstrated at large scale in the WCCI 2022 Learning to Run a Power Network (L2RPN) competition (Dorfer et al., 2022).

1. Problem Definition and Markov Decision Process Formulation

Topology optimization in power grids is formulated as a Markov Decision Process (MDP) where the agent sequentially selects topology-switching actions to maintain secure, reliable operation.

  • State sts_t: At each time step tt, the state vector encodes the full grid configuration, including topology (bus assignments), generator injections (controllable and renewables), load consumptions, storage states, current line flows θi,j\theta_{i,j}, line-loading ratios ρi=flowi/capacityi\rho_i = |\text{flow}_i| / \text{capacity}_i, and the number of lines offline.
  • Action ata_t: Actions are “unitary” topology switches, each corresponding to a substation bus reassignment, drawn from a reduced catalogue AA of approximately 2,000 frequent switches (out of about 72,000 possible). For the extended joint agent, actions also include redispatch vectors Δg\Delta g subject to generation and operational limits.
  • Transition dynamics: Applying action ata_t (and, if relevant, redispatch Δgt\Delta g_t), the agent transitions to st+1=T(st,at)s_{t+1} = \mathcal{T}(s_t, a_t) using a full AC load-flow simulator. Lines are automatically disconnected under severe overload (overflow >200%>200\% or sustained overcapacity), leading to a blackout upon grid islanding or infeasible load-generation mismatch.
  • Terminal conditions: An episode terminates at a grid blackout or after a fixed horizon TT (2016 steps, simulating one week at 5-minute intervals).

2. Reward Specification and Objective

The framework specifies a shaped reward function that penalizes violations of thermal constraints and loss of grid components to balance short-term congestion management and long-term system integrity.

  • Overflow penalty utu_t:
    • If maxiρi,t1\max_i \rho_{i,t} \leq 1, ut=max(ρmax,t0.5,0)u_t = \max(\rho_{\text{max},t} - 0.5, 0).
    • If any ρi,t>1\rho_{i,t} > 1, ut=i:ρi,t>1(ρi,t0.5)u_t = \sum_{i: \rho_{i,t} > 1} (\rho_{i,t} - 0.5).
  • Reward: rt=exp(ut0.5noff,t)r_t = \exp(-u_t - 0.5 \cdot n_{\text{off},t}), where noff,tn_{\text{off},t} is the number of disconnected lines.
  • Objective:

R=t=0T1γtrtR = \sum_{t=0}^{T-1} \gamma^t r_t

with discount factor γ0.999\gamma \approx 0.999.

For the joint topology+redispatch agent, the composite reward incorporates redispatch (Cred,t=gΔgtC_{\text{red},t} = \sum_g |\Delta g_t|) and curtailment (Ccur,tC_{\text{cur},t}) costs:

rtjoint=rtλredCred,tλcurCcur,tr_t^{\text{joint}} = r_t - \lambda_{\text{red}} C_{\text{red},t} - \lambda_{\text{cur}} C_{\text{cur},t}

Parameters λred,λcur\lambda_{\text{red}}, \lambda_{\text{cur}} prioritize congestion relief.

3. AlphaZero Algorithm Adaptation

The AlphaZero approach is specialized for the high-dimensional, constrained topology optimization problem. Key components are:

3.1 Neural Network Architecture

  • Inputs: Features concatenated per-line and per-bus (ρi\rho_i, flows, voltage magnitudes, bus assignments), generator outputs, load demands, and time-of-day encoding.
  • Backbone: 8 residual blocks (Conv 3×33\times3, 128 channels, with ReLU and BatchNorm; \sim1.2M parameters).
  • Policy head: Two 1×11\times1 convolutions, flattened and passed to a fully connected layer for A|A| logits; output is softmax action distribution P(s,)P(s, \cdot) over reduced action space.
  • Value head: Replaced with a non-parametric heuristic: ν(s)=j=tt+hγjtrj\nu(s) = \sum_{j=t}^{t+h} \gamma^{j-t} r_j.

3.2 Monte Carlo Tree Search (MCTS) and Training

  • MCTS with PUCT: Node selection uses

U(s,a)=Q(s,a)+cpuctP(ss0,a)N(s)/[1+N(s,a)],U(s,a) = Q(s,a) + c_{\text{puct}} \cdot P(s|s_0,a)\cdot\sqrt{N(s)}/[1+N(s,a)],

with cpuctc_{\text{puct}} tuned to $1.5$.

  • Action pruning: Only actions in AA are allowed; those immediately leading to subnet disconnection are masked.
  • Early stopping: Search halts if at least tstop=6t_{\text{stop}}=6 “recovery” nodes are found, where a safety monitor ensures the next kskip=10k_{\text{skip}}=10 steps remain below 98%98\% of line capacity (maxρ<98%\max\rho<98\%).
  • Self-play and policy training:

At each step, MCTS yields a visit-count distribution π\pi; training minimizes cross-entropy between π\pi and Pθ(s)P_\theta(s). The value head is not trained.

L=samplesπlogPθ(s)L = \sum_{\text{samples}} -\pi^\top \log P_\theta(s)

  • Training regime: Episodes conducted over realistic fixed load/generation profiles with terminal blackouts or horizon TT.

3.3 Constraint Handling

  • Simulation-level enforcement: All environment transitions are computed via an AC load-flow simulator (Grid2Op + lightsim2grid), which applies Kirchoff’s laws, voltage and thermal limits. Invalid actions (e.g., those that would isolate a generator) are pruned.
  • Feasibility: Any MCTS branch resulting in line overload >200%>200\% is flagged terminal with a heavy negative reward.
  • Soft constraints: The shaped reward function includes a penalty of $0.5$ per offline line.

4. Experimental Evaluation

Experiments are conducted on a large-scale simulated transmission grid corresponding to the WCCI 2022 L2RPN competition testbed:

  • Grid: 118 substations, 186 lines, 62 generators, 91 loads, 7 storages.
  • Episode: One-week (2016 time steps at 5-min intervals), real-world profiles, with random line-disconnect events to test n1n-1 security.

Baselines

Multiple baseline and ablation settings are compared:

Agent Name Description
NoOp (BL) Do nothing
R (BL) Redispatch + curtailment via Cross-Entropy optimizer
T (brute-force) Try all 2,000 switches, pick best step
T (arg-max) Greedy policy network top-1
T (top-25) Simulate top-25 policy actions, pick best
T (MCTS, oracle) MCTS with full simulator lookahead (upper bound)
T (top-5)+R Combine top-5 policy candidates with redispatch (superimposed)

Key Results

Performance is assessed on “steps survived” (fraction of TT), average decision time, average redispatch, and curtailment per step.

Agent Steps Survived (%) Step Time (ms) Redispatch (MW) Curtailment (MW)
NoOp (BL) 19.2 8.9 0.0 0.0
R (BL) 74.5 31.0 504.2 484.4
T (brute-force) 61.1 153.3 0.0 0.0
T (arg-max) 50.4 13.4 0.0 0.0
T (top-25) 65.3 34.2 0.0 0.0
T (MCTS,oracle) 76.9 1714.2 0.0 0.0
T (top-5)+R 82.1 53.3 202.8 193.4

The T(top-5)+R agent achieves 82.1% survival (versus 74.5% for redispatch-only) and reduces redispatch to 40% of R(BL). The method ranked 1st in the WCCI 2022 L2RPN competition (Dorfer et al., 2022).

5. Technical Challenges and Research Directions

5.1 Deployment Challenges

  • Scalability: The present use of a reduced action catalogue (2,000 out of 25322^{532} theoretical topology switches) is ad hoc. Further progress is needed in defining domain-informed macro-actions or hierarchical RL decompositions.
  • Market Coordination: Topology changes must integrate with market schedules, redispatch contracts, and remedial action pricing mechanisms.
  • Uncertainty: Handling forecast errors in renewables, load, and n1n-1 contingencies remains an outstanding challenge.
  • Computational Efficiency: Real-time deployment requires substantially reduced inference and search latency, motivating advances in hardware and algorithmic optimization.

5.2 Algorithmic Enhancements

  • Hybrid Value Estimation: Co-training a neural value head alongside the heuristic or integrating a learned transition model (MuZero-style) may enhance planning capability.
  • Hierarchical/Hindsight Action Pooling: Aggregating substation-level action patterns based on electrical characteristics could enable effective exploration of larger action spaces.
  • Joint Training: Formulating joint topology and redispatch control within multi-agent RL or centralized frameworks may improve overall grid performance.
  • Curriculum Learning: Progressive scaling from IEEE-14/118-bus systems to large grids is recommended for stable policy learning.

6. Implementation Summary

A sketch of the training algorithm follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Input: reduced action set A, policy net P_θ, Grid2Op simulator ℰ, horizon T, discount γ
for epoch=1→N do
  for each scenario i in training set do
    s ← ℰ.reset(i)
    for t=0→T−1 do
      if safe(s) then action a←RecoveryOrNoop(s) else
        {π, a}←MCTS(s,P_θ,γ,t_stop,k_skip)
      end
      s', r ← ℰ.step(s,a)
      store (s, π, r)
      if blackout(s') then break
      s ← s'
    end
  end
  θ ← θ − η·∇_θ ∑_samples[ −πᵀ log P_θ(s) ]
end

This design enables reproducibility and extensibility for AlphaZero-style topology optimization applied to high-fidelity, operationally-constrained power grids, as detailed in (Dorfer et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AlphaZero-Based Topology Optimization.