Hierarchical Entity-Centric RL (HECRL)
- HECRL is a two-level framework that integrates a low-level value-based agent with a high-level conditional diffusion model to decompose complex tasks.
- It generates sparse, entity-factored subgoals that simplify long-horizon, multi-entity manipulation tasks and improve reachability.
- Empirical evaluations demonstrate that HECRL outperforms baseline methods in multi-entity settings and exhibits strong zero-shot generalization.
Hierarchical Entity-Centric Reinforcement Learning (HECRL) is a modular, two-level framework for offline goal-conditioned reinforcement learning (GCRL), designed to address the combinatorial and temporal complexity of long-horizon manipulation tasks in domains populated by multiple interacting entities. HECRL decomposes the global goal into sparse, entity-factorized subgoals and integrates conditional diffusion-based subgoal generation with a value-based GCRL agent, yielding marked performance gains in sparse-reward, high-dimensional domains and enabling robust generalization with respect to increasing numbers and arrangements of entities (Haramati et al., 2 Feb 2026).
1. Two-Level Hierarchical Framework
HECRL employs a compositional structure comprising (1) a low-level value-based GCRL agent and (2) a high-level entity-factored subgoal diffusion generator.
Low-Level GCRL Agent:
The agent operates on the goal-conditioned Markov Decision Process , where the goal space and reward is sparse (). Its architecture includes:
- A Q-network trained via offline data.
- An implicit policy extracted using Deep Deterministic Policy Gradient with Behavioral Cloning (DDPG+BC).
- A value network . The competence radius defines the maximal subgoal distance the agent can reliably traverse without significant temporal-difference (TD) error accumulation.
High-Level Subgoal Diffuser:
This component uses a conditional diffusion model , which, given the current state and final goal, produces an intermediate subgoal expected to be reachable within steps. States and goals are represented in a factored form , , with a set-Transformer architecture employed to denoise individual entity subgoal factors, encouraging subgoals that typically modify only a few entities.
2. Mathematical Formulation and Model Architectures
Factored Spaces and Reward Structure
- State space: , component-wise structured.
- Action space: (e.g., gripper pose changes for robotic manipulation).
- Goal space and reward: , if , otherwise (strictly sparse).
Value and Policy Objectives
- Value Loss: Expectile regression for ,
with .
- Q-Function Loss: Regression to TD targets,
- Policy Extraction: DDPG+BC objective,
Subgoal Diffusion Generation
- Training Data: Triplets obtained by sampling , final goal , and intermediate subgoal from offline trajectories.
- Diffusion Process: Forward noise process
for ; denoiser is trained with
- Generation: At test time, reverse steps yield candidate subgoals per query.
3. Training Procedures and Test-Time Composition
Independent Training of Components
- RL Agent: Dataset with M transitions. Approximated using Adam optimizer ( learning rate), batch size $512$, $2$–$3$M updates, , expectile , DDPG+BC .
- Diffusion Subgoal Generator: Uses same data for examples with ; 8-layer set-Transformer, hidden dim $256$, $8$ heads, diffusion steps, $200K$ optimization steps.
Test-Time Subgoal Selection and Execution
- On every step interval, subgoal candidates are sampled from . Only those with (reachability threshold) are retained. The candidate maximizing is selected; if no improvement over , the agent targets the final goal directly. The low-level policy acts toward the selected subgoal for one step, repeating until the global goal is achieved.
4. Empirical Evaluation and Benchmarks
HECRL is evaluated on a suite of sparse-reward, multi-entity manipulation tasks:
- PPP-Cube: Pick/place/push with 3 cubes, using state or dual-view RGB observations.
- Stack-Cube: Block stacking.
- Scene: Drawer, window, button, and cube entities.
- Push-Tetris: 2D block pushing to target positions and orientations.
Performance Table (meanstd, 4 seeds):
| Task | EC-SGIQL | EC-IQL | EC-Diffuser | HIQL | IQL |
|---|---|---|---|---|---|
| PPP³-State | 82.5±3.1 | 51.5±4.4 | 44.8±6.7 | 48.3±7.3 | 34.3±4.9 |
| Stack³-State | 43.5±1.9 | 29.0±2.9 | 43.8±9.2 | 0.0±0.0 | 19.3±3.0 |
| PPP³-Image | 64.3±4.9 | 25.0±5.7 | 0.3±0.5 | 0.0±0.0 | 0.0±0.0 |
| Scene-Image | 61.5±5.9 | 53.0±5.5 | 3.3±2.5 | 8.3±1.3 | 17.5±2.7 |
| Push-Tetris (cov) | 61.4±3.3 | 31.6±1.3 | 7.9±0.5 | 5.2±0.8 | 3.4±0.8 |
Zero-Shot Generalization:
PPP-Cube trained on 3 cubes generalizes to 4/5/6 cubes with EC-SGIQL achieving success, compared to EC-IQL's .
5. Ablation Studies and Analytical Findings
- Subgoal Selection: Ablations demonstrate both the necessity of value-threshold filtering and value-guided selection for high success rates; diffusion-based sampling outperforms AWR-based deterministic high-levels.
- Max-Value (no reachability):
- Random-Sample (no -guidance):
- AWR-based:
- Factor Sparsity: Entity-centric diffusion results in subgoals modifying an average of cubes per step (vs in AWR), promoting local reachability and efficient decomposition.
- Hyperparameter Robustness:
- : Robust if .
- : Optimal around $25-50$.
- : Peak performance for .
6. Implementation Details and Practical Considerations
Model Architectures:
- Low-level: 4-layer MLPs (state-based) or 3-layer Entity Interaction Transformer (hidden dim=256, heads=8).
- Subgoal diffuser: 8-layer set-Transformer, hidden dim=256, heads=8.
- Vision encoders: DLPv2 (latent particles), VQ-VAE (codebook size 2048, embedding dim=16).
Training Regime:
- RL: $2.5$–$3$M gradient steps, Adam (), batch $512$, on $4$–$8$ A100 GPUs ( hours).
- Diffusion: $200$K steps, Adam (), batch $256$ ( hours).
Compute:
All stages—including encoder pretraining, RL, and diffusion—fit within a single 8-GPU node.
7. Summary of Approach and Ongoing Directions
HECRL's entity-centric, hierarchical architecture enables scalable RL in environments where direct goal-reaching is difficult due to entity combinatorics and sparse rewards. Its factored diffusion subgoal generator, composed modularly with any value-based GCRL agent, empirically achieves over higher success on benchmark image-based multi-entity manipulation tasks, demonstrates zero-shot scalability with increasing entity counts, and shows minimal sensitivity to key hyperparameters. Future research directions include automatic inference of optimal subgoal horizon or competence threshold from data, hybridization of diffusion and value-based guidance for subgoal optimality, and extensions to real-world robotics as structured object-centric video representations mature. All code, datasets, and pretrained models are publicly available (Haramati et al., 2 Feb 2026).