Using reinforcement learning to probe the role of feedback in skill acquisition

Published 9 Dec 2025 in cs.AI, cs.LG, cs.RO, and eess.SY | (2512.08463v1)

Abstract: Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that incorporating high-dimensional flow feedback significantly improves the learning performance of RL agents in chaotic fluid systems.
Methodology includes using DreamerV3 on a tabletop water channel for drag maximization and minimization tasks, underscoring the impact of sensor feedback.
Findings reveal that rich feedback is essential for policy discovery during training, while minimal feedback suffices for execution.

Using Reinforcement Learning to Probe the Role of Feedback in Skill Acquisition

Introduction

The paper "Using reinforcement learning to probe the role of feedback in skill acquisition" (2512.08463) offers a rigorous empirical and theoretical investigation into the dependence of skill acquisition on feedback modalities by directly leveraging generalist RL agents in physical, high-dimensional chaotic systems. Rather than employing human subjects, the study utilizes DreamerV3, a state-of-the-art model-based RL agent, to interact with a real-world spinning cylinder immersed in a turbulent flow. The experimental paradigm encompasses two canonical control tasks—drag maximization and drag minimization—allowing a systematic dissection of the informational prerequisites for both learning and execution of motor skills under varying feedback conditions.

Experimental Setup and Methodology

The experimental platform is a low-cost tabletop circulating water channel (CWC), depicted in Figure 1, housing an actuated cylindrical body whose rotation rate is directly controlled by the RL agent. Observations consist variably of drag force measurements, commanded and observed rotation rates, and optionally a high-dimensional flow field estimation computed in real time using particle image velocimetry (PIV). The dynamical environment is inherently infinite-dimensional, governed by the incompressible Navier–Stokes equations in a chaotic regime, rendering accurate simulation intractable and providing a compelling testbed for RL algorithms.

Figure 1: Top: The tabletop circulating water channel setup with labeled hardware components and PIV-based vorticity snapshots in the wake of the actuated cylinder.

The RL agent’s action space is the normalized instantaneous angular velocity $\omega(t)$ of the cylinder. Reward signal differs by task: for minimization, it is the episode-average decrease in drag with respect to baseline; for maximization, an increase.

Principal Findings

Role of Flow Feedback in Acquisition Versus Execution

A pivotal outcome is the dissociation between the informational demands of learning and those of execution. With access to high-dimensional flow feedback during training, DreamerV3 discovers high-performance policies for both drag extremization tasks within minutes—remarkable given the complexity of the flow and the absence of simulation. When the executed policy (i.e., the action sequence) is replayed open-loop (without any feedback or observations, but in the same initialization), the realized performance nearly matches the online closed-loop trajectory.

Figure 2: System overview: The RL agent commands the rotation; observations can include varying subsets of proprioceptive and exteroceptive signals, including the PIV flow estimate.

Figure 3: Training curves for drag variation (normalized) for DreamerV3 with and without flow feedback in both tasks, showing mean and min-max bands.

Notably, withholding flow feedback during training produces drastically asymmetric results:

In drag minimization, the agent reliably discovers good (albeit more variable and slower-converging) policies even in the absence of flow feedback.
In drag maximization, discovery of high-performance policies fails entirely without flow feedback—even though the task is equivalent up to reward sign, physical system, and open-loop action complexity.

Feedback as a Catalyst for Skill Discovery

A core claim, empirically justified in the study, is that the information needed to learn a high-performance policy can far exceed that required to execute it. In drag maximization, the flow system exhibits deceptive non-minimum-phase transients: actions that ultimately increase drag initially reduce it. This effect, evidenced by anti-correlated early rewards, leads to failed exploration and suboptimal policy search without enriched feedback signals. Inclusion of the flow feedback resolves the aliasing and misleading transients, facilitating the successful discovery of the open-loop maximizing input.

Figure 4: Replay of learned policies in open loop shows close agreement between closed-loop (blue) and replayed (red) drag profiles; the advantage of feedback is marginal at execution except for small corrections.

Figure 5: Aggregated results over multiple observation configurations; flow feedback is critical for sample-efficient optimization in drag maximization but not for minimization.

Insensitivity to Model Size and Algorithm Choice

Ablations confirm that the observed phenomena are robust to variations in observation subsets, DreamerV3 model size, method of feedback removal (branch omission vs. null input), and persist across multiple standard RL baselines (PPO, SAC), which fail to solve the maximization task without privileged feedback.

Figure 6: Learning dynamics under model size variation; performance outcomes are invariant to model parameter count.

Figure 7: PPO, PPO-R, and SAC fail to approach optimal drag-maximizing performance without flow feedback.

Theoretical and Practical Implications

Feedback, Exploration, and Non-Minimum-Phase Dynamics

The experimental outcomes directly intersect with robust and adaptive control theory, especially regarding non-minimum-phase systems where transient system responses can actively mislead reward-driven exploration. The findings empirically instantiate classical learning-theoretic results: sufficiently rich observations during training can resolve ambiguities and mitigate pathologies that confound generic model-free RL in high-dimensional, partially observable, nonlinear environments.

Privileged Information and Architecture Desiderata

The study motivates a new class of observation-adaptive agents: architectures where sensor input selection, and even feedback utilization per se, is dynamic and context-dependent, potentially regulated by epistemic uncertainty estimates or predicted performance loss. Existing paradigms such as asymmetric actor-critic, distilled privileged sensing, and event-triggered RL are referenced, but the results here go further—arguing for architectures that can exploit high-dimensional feedback during learning yet disregard it (except for marginal corrections) during execution. This model aligns with empirical motor-control phenomena observed in humans and biological systems and suggests fertile directions for both theory and hardware implementations.

Fluids Control and Benchmarks

On the domain science side, the work contributes an open-source, physically grounded benchmark for fluids control with high-throughput, real-time flow estimation—a resource-poor area where simulation has severe limitations.

Outlook and Future Directions

The study's methodology is broadly extensible to other nonlinear, non-minimum-phase physical systems. Next steps include theoretically characterizing the regimes under which task/reward design, system dynamics, or observation structure create "kind" versus "wicked" learning conditions; developing adaptive-feedback RL architectures that optimize the cost-performance tradeoff of sensor utilization; and applying these principles to higher-dimensional embodied control in robotics and bio-inspired automation.

The experimental testbed also positions itself as a foundation for research into continual learning, adversarial fluid control, and sensor placement optimization, among other lines.

Conclusion

This work provides compelling evidence that skill acquisition in complex physical environments often demands a richer sensorimotor data stream than is required for optimal execution, a property linked to system dynamics and reward structures. Flow feedback is shown to be essential for learning high-performance open-loop policies in non-minimum-phase tasks but unnecessary for their deployment. These findings underscore the limitations of fixed-observation RL pipelines and motivate research into flexible, observation-adaptive agent designs. The low-cost physical infrastructure and benchmark released by the authors pave the way for longitudinal and cross-domain investigations into the fundamental mechanics of sensorimotor learning.

Markdown Report Issue