SigmaRL Policy for Multi-agent CAVs

Updated 30 January 2026

SigmaRL-trained policy is a multi-agent motion-planning framework for CAVs that uses sample-efficient centralized training and decentralized execution.
It leverages MAPPO with a 32-dimensional ego-centric observation vector to optimize continuous controls through a kinematic vehicle model.
The approach delivers robust zero-shot generalization and effective sim-to-real transfer, achieving low collision rates and precise lane adherence across varied scenarios.

A SigmaRL-trained policy refers to a multi-agent motion-planning policy for connected and automated vehicles (CAVs) derived from the SigmaRL framework, characterized by sample-efficient centralized training, decentralized execution, and strong zero-shot generalization properties. SigmaRL-trained policies are optimized using Multi-Agent Proximal Policy Optimization (MAPPO), leveraging information-dense, ego-centric observations designed to facilitate robust transfer across diverse traffic scenarios and hardware platforms (Xu et al., 2024, Beerwerth et al., 23 Jan 2026).

1. Problem Formulation and Policy Objective

SigmaRL addresses the multi-agent motion-planning problem in connected mobility as a partially observable Markov Game (POMG) with $N$ decentralized vehicle agents. Each agent $i$ operates over an observation space $\mathcal{O}^{(i)}$ and continuous action space $\mathcal{A}^{(i)}$ , receiving local, structured observations encoding both self features and the geometry and states of nearest neighbors: $o_t^{(i)} = \left(o_{\rm self,t}^{(i)},\,\left\{o_{\rm sur},t}^{(i,j)}\right\}_{j=1}^{n_{\rm sur}}\right).$ The training objective for each policy $\pi_\theta^{(i)}$ is to maximize the expected discounted return across stochastic transitions under a reward structure penalizing collisions and lane departures, and rewarding adherence to reference trajectories and efficient velocities. Centralized training is conducted with access to joint state information, while execution is fully decentralized.

2. Observation, State, and Action Design Principles

SigmaRL-trained policies rely on a 32-dimensional ego-centric observation vector per agent, with features engineered to improve both sample efficiency and generalization:

Self-features: Own longitudinal speed, distances to lane boundaries, deviation from lane centerline, and $n_{\rm p,RP}=3$ centerline reference points in local coordinates.
Neighbor-features: For up to $n_{\rm sur}=2$ nearest neighbors, four polygon vertices (projected onto ego-frame), neighbor speed, and Euclidean distance.

Actions are two-dimensional continuous controls, $[u_v, u_\delta]$ , representing velocity and steering rate, and are input directly to a kinematic single-track (bicycle) vehicle model. These actions are bounded and squashed using $\tanh$ before linear rescaling to the hardware control ranges.

3. Neural Network Architecture and Algorithmic Framework

SigmaRL employs homogeneous parameter sharing in MAPPO, yielding two main network modules:

Actor Network: A feed-forward MLP (32 input → 3×256 hidden units, Tanh activations) outputs mean and log-std for a diagonal Gaussian over the action space, enabling stochastic policy sampling.
Critic Network: Centralized state-value estimator with the same backbone, inputting concatenated observations for all $N$ agents and producing a scalar estimate $V_\phi(\mathbf{o})$ .

Optimization uses the PPO clipped surrogate loss [Schulman et al. 2017], extended for multi-agent settings: $L(\theta,\phi) = \mathbb{E}_t\left[\min(r_t(\theta)\hat{A}_t, \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right] - c_v\mathbb{E}_t[(V_\phi(s_t)-R_t)^2] + c_E\mathbb{E}_t[H(\pi_\theta(\cdot|s_t))]$ with typical hyperparameters $\gamma=0.99$ , $\epsilon=0.2$ , $c_v=0.5$ , $c_E=0.01$ .

4. Training Protocol and Zero-Shot Evaluation

SigmaRL policies are trained solely on the intersection sub-map of the CPM Scenario, using randomization for starting configurations and reference tracks. The regime consists of 250 episodes (4096 samples/episode; ≈1 million total), with batch update and experience replay to ensure sample efficiency (runtime $<$ 1 hr on single CPU). Episode reset occurs on collision, driving rapid learning of collision-avoidant behaviors.

Zero-shot generalization is then assessed by deploying the policy in four scenarios with distinct geometries (CPM full map, new intersection, on-ramp, roundabout), none included during training. Evaluation uses standardized metrics:

Agent-agent collision rate (CR $_{\rm A\!-\!A}$ )
Agent-lane collision rate (CR $_{\rm A\!-\!L}$ )
Centerline deviation (CD, cm)
Average speed (AS, m/s)

Performance consistently shows $\mathrm{CR}_{\rm total} < 1\%$ and $\mathrm{AS}>0.67$ m/s across all domains.

5. Sim-to-Real Transfer and Domain Adaptation

SigmaRL-trained policies were subjected to sim-to-real transfer within the Cyber-Physical Mobility Lab benchmark (Beerwerth et al., 23 Jan 2026), which spans simulation (SigmaRL, direct control), a high-fidelity digital twin (grey-box vehicle models with realistic noise and hierarchical MPC control stack), and a physical testbed (1:18-scale “μCars” with camera-based localization, sensor/actuator latencies). Performance degradation arises chiefly from two sources:

Architectural mismatch: Direct application of actions in simulation versus hierarchical model predictive control (MPC) in digital twin/lab, yielding input smoothing and reduced lane violations/CD.
Environmental realism: Physical domain introduces actuation delays, measurement noise, and localization latency, leading to increased collision rates (CR $_{\rm A}$ rises by up to $+2.39$ events/100m sim→lab).

Recommended mitigations include incorporating the MPC stack during training and domain randomization for hardware-related uncertainties.

6. Ablation Studies and Feature Importance

A comprehensive ablation across five observation-design strategies demonstrated that information-dense features (ego-view, neighbor vertices, geometric distances, boundary metrics, centerline deviation) are critical for generalization and sample efficiency (Xu et al., 2024). Removing any strategy:

Impairs safety (increases agent-agent collisions)
Reduces lane-following fidelity (raises centerline deviation)
Yields conservative or erratic velocity profiles

The full SigmaRL model ( $M_0$ ) uniformly outperforms ablated baselines in composite evaluation score, confirming its design efficacy.

7. Limitations and Future Directions

SigmaRL-trained policies remain subject to sim-to-real gaps stemming from architectural and physical domain mismatches. Recommendations include:

Integrating mid-level MPC in the learning loop
Employing domain randomization/system identification for latent hardware parameters
Relaxing global state, progressing toward partial observability and onboard perception (e.g., LiDAR/camera inputs)
Investigating communication constraints and scaling to full-scale platforms

A plausible implication is that further alignment between simulator and real-world control architectures is critical for robust sim-to-real policy deployment. Ongoing research into observation structure and domain adaptation will continue to inform best practices in MARL for mobility applications.

Markdown Report Issue Upgrade to Chat

References (2)

SigmaRL: A Sample-Efficient and Generalizable Multi-Agent Reinforcement Learning Framework for Motion Planning (2024)

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SigmaRL-Trained Policy.