Hardware as Policy: Mechanical and Computational Co-Optimization using Deep Reinforcement Learning

Published 11 Aug 2020 in cs.RO | (2008.04460v2)

Abstract: Deep Reinforcement Learning (RL) has shown great success in learning complex control policies for a variety of applications in robotics. However, in most such cases, the hardware of the robot has been considered immutable, modeled as part of the environment. In this study, we explore the problem of learning hardware and control parameters together in a unified RL framework. To achieve this, we propose to model the robot body as a "hardware policy", analogous to and optimized jointly with its computational counterpart. We show that, by modeling such hardware policies as auto-differentiable computational graphs, the ensuing optimization problem can be solved efficiently by gradient-based algorithms from the Policy Optimization family. We present two such design examples: a toy mass-spring problem, and a real-world problem of designing an underactuated hand. We compare our method against traditional co-optimization approaches, and also demonstrate its effectiveness by building a physical prototype based on the learned hardware parameters. Videos and more details are available at https://roamlab.github.io/hwasp/ .

Abstract PDF Upgrade to Chat

Authors (3)

Citations (43)

View on Semantic Scholar

Summary

The paper introduces Hardware as Policy (HWasP) that jointly optimizes mechanical design and control by integrating hardware parameters into the reinforcement learning policy.
It models mechanical components, such as tendon-driven transmissions in robotic hands, as differentiable functions for gradient-based optimization.
HWasP outperforms traditional methods like CMA-ES by achieving superior learning efficiency and design quality in high-dimensional parameter spaces.

Hardware as Policy: Mechanical and Computational Co-Optimization using Deep Reinforcement Learning

Introduction

In "Hardware as Policy: Mechanical and Computational Co-Optimization using Deep Reinforcement Learning," the authors propose a novel framework that unifies the optimization of both hardware and computational control policies for robotic systems within a single reinforcement learning (RL) architecture. The paper introduces the concept of modeling robot hardware as an integral part of the policy, termed a "hardware policy", facilitating joint optimization of mechanical and computational elements. This approach leverages the power of Deep Reinforcement Learning (Deep RL) to efficiently solve complex co-design problems, traditionally handled by separate optimization processes.

Figure 1: Traditional perspective --- Reinforcement Learning with a purely computational policy.

Methodology

The core innovation of the paper is the transformation of hardware parameters into differentiable components of the RL policy. By incorporating the mechanical model of the robot into a computational graph, the framework enables gradient-based optimization of hardware parameters alongside the computational control strategy. This integration is achieved by modeling the mechanical properties—such as tendon-driven transmissions in robotic hands—as differentiable functions within the learning algorithm.

The proposed Hardware as Policy (HWasP) method builds upon traditional Policy Optimization techniques, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). It diverges from classical approaches by considering the hardware as part of the agent, allowing the environment to be modeled more abstractly, focusing only on the remaining non-hardware interactions.

Figure 2: Hardware as Policy-Minimal.

Case Studies

Mass-spring Toy Problem

The paper first demonstrates HWasP using a mass-spring toy problem. In this one-dimensional setup, two masses connected by springs are controlled by a motor subject to optimization. HWasP is enacted by modeling spring stiffness as part of the action space, allowing reinforcement learning to optimize both the spring constants and the motor control policy concurrently.

Figure 3: Problem descriptions, left: HWasP, right: HWasP-Minimal.

Underactuated Hand Design

In a more complex scenario, the method is applied to designing an underactuated robotic hand. Here, the hand, driven by a single motor through tendon mechanisms, is optimized for versatile grasping tasks. The hardware parameters, including pulley radii and tendon elasticities, are adjusted alongside the computational policy that guides the hand movement. The paper showcases simulations that highlight the effectiveness of HWasP in achieving high-quality hand designs that are validated through sim-to-real transfer techniques.

Figure 4: Hand design optimization problem. Left: hand kinematics, dimension, and tendon routing. Right: successful grasps executed in simulation and on a real hand prototype.

Comparative Analysis

The effectiveness of HWasP and its minimal version, HWasP-Minimal, is contrasted with traditional co-design approaches using evolutionary strategies like CMA-ES. The results indicate that HWasP exhibits superior learning efficiency and solution quality, especially in high-dimensional parameter spaces, due to its ability to use gradient-based methods.

Figure 5: Training curves for the grasping problem. Left: Z-Grasp with a large hardware parameter search range. Middle: Z-Grasp with a small hardware search range. Right: 3D-Grasp with a small search range.

Conclusion

The paper presents a robust framework for simultaneous mechanical and computational optimization in robotic systems, offering significant improvements over traditional methods by using RL. The HWasP approach creatively integrates mechanical components into the policy structure, promising more effective and efficient co-design processes. The success in both toy and real-world problem setups suggests potential for broader applications in robotics, where complex interactions between hardware and control systems are frequent. Future work may expand on this framework to include broader aspects of hardware design and explore real-time adaptation in dynamic environments.

Markdown Report Issue