Controlling the Solo12 Quadruped Robot with Deep Reinforcement Learning

Published 2 Aug 2023 in cs.RO and cs.LG | (2309.16683v1)

Abstract: Quadruped robots require robust and general locomotion skills to exploit their mobility potential in complex and challenging environments. In this work, we present the first implementation of a robust end-to-end learning-based controller on the Solo12 quadruped. Our method is based on deep reinforcement learning of joint impedance references. The resulting control policies follow a commanded velocity reference while being efficient in its energy consumption, robust and easy to deploy. We detail the learning procedure and method for transfer on the real robot. In our experiments, we show that the Solo12 robot is a suitable open-source platform for research combining learning and control because of the easiness in transferring and deploying learned controllers.

Abstract PDF Upgrade to Chat

Citations (20)

View on Semantic Scholar

Summary

The paper presents a deep RL-based controller achieving zero-shot sim-to-real transfer on the Solo12 quadruped.
It integrates an MLP policy with classical impedance control and explicit power loss modeling to optimize energy efficiency and locomotion performance.
The study employs curriculum learning and domain randomization to ensure robust velocity tracking and adaptive gait behavior in diverse terrains.

Deep Reinforcement Learning Control of the Solo12 Quadruped

Introduction

The paper "Controlling the Solo12 Quadruped Robot with Deep Reinforcement Learning" (2309.16683) presents a comprehensive methodology for end-to-end control of the Solo12 quadruped robot using deep reinforcement learning (RL). The work addresses the challenges of robust, energy-efficient, and transferable locomotion control on a low-cost, open-source platform. The approach leverages curriculum learning, domain randomization, and a carefully designed reward structure to achieve zero-shot sim-to-real transfer, with extensive empirical validation both in simulation and on physical hardware.

Methodological Framework

RL Formulation and Policy Architecture

The control problem is formulated as a continuous-state, continuous-action Markov Decision Process (MDP). The policy is parameterized by a multi-layer perceptron (MLP) with three hidden layers (256, 128, 32 units, Leaky ReLU activations), mapping proprioceptive and inertial state observations to joint angle displacements. The policy outputs are interpreted as displacements from a nominal pose, which are then used as targets for a PD impedance controller. This hybrid approach combines the stability of classical impedance control with the flexibility of learned policies.

A separate state estimation network, also an MLP, is trained via supervised learning to estimate the base linear velocity from IMU and joint encoder data, compensating for the lack of direct velocity measurements on the real robot.

State and Action Spaces

The state vector includes:

Base orientation, angular velocity (from IMU)
Joint angles, joint velocities, and histories of joint target errors and actions (to encode contact and terrain information)
Commanded 3D velocity

The action space consists of 12-dimensional joint angle displacements, scaled and added to the nominal configuration. The resulting target angles are tracked by the PD controller, which computes torques as: $\tau_t = K_p (q^{target}_t - q_t ) - K_d \dot{q}_t$ This design choice is motivated by empirical evidence that position-based control is more sample-efficient and robust to sim-to-real transfer than direct torque control.

Reward Function Design

The reward function is a weighted sum of:

Velocity tracking: Exponential penalty on deviation from commanded velocity
Foot clearance: Penalizes insufficient foot lift during swing
Foot slip: Penalizes nonzero foot velocity during stance
Base orientation/velocity: Penalizes roll, pitch, and vertical velocity deviations
Joint pose: Penalizes deviation from nominal joint angles
Power loss: Penalizes actuator energy consumption, explicitly modeling Joule and friction losses
Action smoothness: Penalizes first and second differences in target joint angles

The explicit modeling of actuator power loss, using identified friction and electrical parameters, is a notable contribution. The authors demonstrate that tuning a single power loss weight is more effective and interpretable than tuning separate penalties for torque, velocity, and acceleration.

Domain Randomization and Curriculum Learning

To bridge the sim-to-real gap, the training incorporates:

Observation noise: Uniform noise on all proprioceptive and inertial measurements
Dynamics noise: Randomization of PD gains
Curriculum learning: Gradual increase of penalty weights and noise magnitudes, and staged introduction of rough terrain

The curriculum is decoupled for reward penalties and noise, allowing the agent to first master velocity tracking before being exposed to the full complexity of the task. Terrain randomization is introduced only in the final training phase, with PPO hyperparameters adjusted to prevent catastrophic forgetting.

Experimental Results

Velocity Tracking and Gait Adaptation

The learned controller achieves accurate tracking of commanded 3D velocities, as validated by motion capture and state estimation. The policy generalizes to both symmetric and asymmetric leg configurations. Notably, the gait frequency adapts proportionally to the commanded velocity, an emergent property not explicitly encoded in the reward or architecture.

Energy Efficiency

Varying the power loss penalty $c_E$ produces a clear trade-off between energy consumption and velocity tracking accuracy. For $c_E$ in the range $[3,4]$ , the policy achieves a >30% reduction in power consumption with only a modest increase in velocity error. For $c_E > 10$ , the policy degenerates, prioritizing energy minimization over locomotion. The explicit power loss penalty outperforms traditional torque/velocity/acceleration penalties in both efficiency and ease of tuning.

Ablation Studies

Ablation experiments confirm the necessity of curriculum learning and staged terrain exposure. Policies trained without curriculum exhibit slower convergence, higher variance, and lower final performance. Training solely on velocity tracking without penalties yields high tracking accuracy but poor overall locomotion quality, with excessive slip and instability.

Sim-to-Real Transfer

The policy, trained entirely in simulation with domain randomization, transfers successfully to the physical Solo12 robot on the first attempt. The controller demonstrates robust locomotion on diverse indoor and outdoor terrains, including slopes and rough ground, without requiring actuator model learning or additional sim-to-real adaptation. This is attributed to the fast, low-inertia dynamics of Solo12 and the sufficiency of simple randomization strategies.

Implementation and Resource Considerations

Training is performed in Raisim with 300 parallel simulated robots, requiring approximately 300 million samples for convergence.
The policy inference time is $10~\mu s$ on a Raspberry Pi 4, enabling real-time deployment at 100 Hz.
The open-source codebase aligns with the Solo12 platform's mission of reproducibility and accessibility.

Implications and Future Directions

This work demonstrates that deep RL, when combined with appropriate reward shaping, domain randomization, and curriculum learning, can yield robust, energy-efficient, and transferable controllers for low-cost quadruped robots. The explicit modeling of actuator power loss and the minimal sim-to-real gap achieved without complex actuator modeling are significant findings.

Practically, this approach lowers the barrier for deploying advanced locomotion controllers on open-source platforms, facilitating broader research and educational use. Theoretically, the results suggest that for certain robot morphologies, simple randomization and impedance-based action spaces may suffice for direct sim-to-real transfer, challenging the necessity of more elaborate sim-to-real techniques in all cases.

Future research directions include:

Large-scale empirical comparisons between model-based and RL-based controllers on Solo12
Extension to more complex tasks (e.g., perception-driven locomotion, manipulation)
Investigation of transferability to heavier or more complex quadruped platforms
Integration of vision and exteroceptive sensing for terrain adaptation

Conclusion

The paper provides a rigorous and reproducible methodology for deep RL-based control of the Solo12 quadruped, achieving robust, energy-aware, and transferable locomotion. The combination of curriculum learning, domain randomization, and explicit power modeling sets a strong precedent for future work in learning-based legged locomotion, particularly on open-source hardware platforms.