Theoretical guarantees for action chunking Q-learning with arbitrary off-policy data

Establish formal theoretical guarantees for Q-learning with action chunking critics when trained on arbitrary off-policy datasets, beyond settings where the dataset is collected by an action chunking policy. Specifically, characterize conditions under which convergence and near-optimality hold and quantify any biases that arise without assuming the data originates from an action chunking policy.

Background

The paper studies Q-learning with action chunking critics, which estimate values of multi-step action sequences to reduce bootstrapping bias and accelerate value propagation. Prior analyses have typically assumed that the dataset is generated by action chunking policies, leaving unclear whether guarantees extend to general off-policy data. The authors introduce an open-loop consistency condition and provide several bounds under this framework, but they explicitly note that comprehensive guarantees for arbitrary off-policy data remain open.

This open problem is central to understanding when chunked critics offer provable advantages over standard multi-step returns, and what assumptions on the data distribution are sufficient for convergence and near-optimality without restrictive data-collection requirements.

References

However, theoretical guarantees of action chunking Q-learning, especially on arbitrary off-policy data, are still an open problem as existing analysis (e.g., in \citet{li2025reinforcement}) only considers the case where the data is collected by an action chunking policy.

Decoupled Q-Chunking  (2512.10926 - Li et al., 11 Dec 2025) in Section 1 (Introduction)