Identify critic parameterizations and compute-allocation strategies

Identify the appropriate parameterization for reinforcement learning critics and determine how to allocate variable test-time compute to integration or iterative computation in order to best exploit the interplay between fast inner-loop adaptation and slow outer-loop weight updates.

Background

The authors connect their findings to a broader bi-level perspective, where weight updates act as a slow outer-loop and iterative computation during inference acts as a fast inner-loop. They suggest that aligning training with this iterative mechanism is crucial for robust learning.

However, the design space for critic parameterizations and policies for allocating variable compute across integration steps is not yet characterized, and the authors explicitly flag these choices as open questions.

References

But what is the right parameterization for critics and how to spend variable computation are open questions for critics.

What Does Flow Matching Bring To TD Learning?  (2603.04333 - Agrawalla et al., 4 Mar 2026) in Section 6, Discussion and Perspectives on Future Work