Hierarchical Entity-Centric RL (HECRL)

Updated 5 February 2026

HECRL is a two-level framework that integrates a low-level value-based agent with a high-level conditional diffusion model to decompose complex tasks.
It generates sparse, entity-factored subgoals that simplify long-horizon, multi-entity manipulation tasks and improve reachability.
Empirical evaluations demonstrate that HECRL outperforms baseline methods in multi-entity settings and exhibits strong zero-shot generalization.

Hierarchical Entity-Centric Reinforcement Learning (HECRL) is a modular, two-level framework for offline goal-conditioned reinforcement learning (GCRL), designed to address the combinatorial and temporal complexity of long-horizon manipulation tasks in domains populated by multiple interacting entities. HECRL decomposes the global goal into sparse, entity-factorized subgoals and integrates conditional diffusion-based subgoal generation with a value-based GCRL agent, yielding marked performance gains in sparse-reward, high-dimensional domains and enabling robust generalization with respect to increasing numbers and arrangements of entities (Haramati et al., 2 Feb 2026).

1. Two-Level Hierarchical Framework

HECRL employs a compositional structure comprising (1) a low-level value-based GCRL agent and (2) a high-level entity-factored subgoal diffusion generator.

Low-Level GCRL Agent:

The agent operates on the goal-conditioned Markov Decision Process $(S, A, \mu, p, r)$ , where the goal space $\mathcal{G} = \mathcal{S}$ and reward is sparse ( $r(s, g) = 1_{s=g} - 1$ ). Its architecture includes:

A Q-network $Q_\phi(s,a,g)$ trained via offline data.
An implicit policy $\pi(a|s,g)$ extracted using Deep Deterministic Policy Gradient with Behavioral Cloning (DDPG+BC).
A value network $V_\psi(s,g) = \max_a Q_\phi(s,a,g)$ . The competence radius $R^{V_\pi}$ defines the maximal subgoal distance the agent can reliably traverse without significant temporal-difference (TD) error accumulation.

High-Level Subgoal Diffuser:

This component uses a conditional diffusion model $\mathcal{D}_\theta : (s,g) \rightarrow \tilde{g}$ , which, given the current state and final goal, produces an intermediate subgoal $\tilde{g}$ expected to be reachable within $K$ steps. States and goals are represented in a factored form $s = (s_1,\ldots,s_M)$ , $g = (g_1,\ldots,g_M)$ , with a set-Transformer architecture employed to denoise individual entity subgoal factors, encouraging subgoals that typically modify only a few entities.

2. Mathematical Formulation and Model Architectures

Factored Spaces and Reward Structure

State space: $S = S_1 \times S_2 \times \dots \times S_M$ , component-wise structured.
Action space: $A$ (e.g., gripper pose changes for robotic manipulation).
Goal space and reward: $\mathcal{G} = S$ , $r(s, g) = 0$ if $s = g$ , $-1$ otherwise (strictly sparse).

Value and Policy Objectives

Value Loss: Expectile regression for $V_\psi$ ,

$L_V(\psi) = \mathbb{E}_{(s, a, s', g) \sim D}\left[|V_\psi(s, g) - y|_\tau^2\right], \quad y = r(s,g) + \gamma V_\psi(s',g),$

with $|x|^2_\tau = |\tau - 1_{x < 0}| x^2$ .

Q-Function Loss: Regression to TD targets,

$L_Q(\phi) = \mathbb{E}_{(s,a,s',g)}[(Q_\phi(s,a,g) - (r + \gamma V_\psi(s',g)))^2]$

Policy Extraction: DDPG+BC objective,

$L_\pi = \mathbb{E}_{(s, a, g)}\left[-Q_\phi(s,\pi_\theta(s,g),g) + \alpha \|\pi_\theta(s, g) - a\|^2\right]$

Subgoal Diffusion Generation

Training Data: Triplets $(s, \tilde{g}, g)$ obtained by sampling $s = s_t$ , final goal $g = s_{t_g}$ , and intermediate subgoal $\tilde{g} = s_{\min(t+K, t_g)}$ from offline trajectories.
Diffusion Process: Forward noise process

$q(\tilde{g}^0) = \text{data}, \quad q(\tilde{g}^\tau|\tilde{g}^{\tau-1}) = \mathcal{N}(\sqrt{1-\beta_\tau}\tilde{g}^{\tau-1}, \beta_\tau I)$

for $\tau=1\dots T$ ; denoiser $\varepsilon_\theta$ is trained with

$L_\text{diff}(\theta) = \mathbb{E}_{s,g,\tilde{g},\tau,\varepsilon}[\|\varepsilon - \varepsilon_\theta(\tilde{g}^\tau, s, g, \tau)\|^2]$

Generation: At test time, $T$ reverse steps yield $N$ candidate subgoals per query.

3. Training Procedures and Test-Time Composition

Independent Training of Components

RL Agent: Dataset $D$ with $\sim 3$ M transitions. Approximated using Adam optimizer ( $3 \times 10^{-4}$ learning rate), batch size $512$, $2$–$3$M updates, $\gamma=0.99$ , expectile $\tau=0.9$ , DDPG+BC $\alpha \in [0.05, 0.3]$ .
Diffusion Subgoal Generator: Uses same data $D$ for $(s, \tilde{g}, g)$ examples with $K=50$ ; 8-layer set-Transformer, hidden dim $256$, $8$ heads, $T=10$ diffusion steps, $200K$ optimization steps.

Test-Time Subgoal Selection and Execution

On every $T_\text{sg}$ step interval, $N$ subgoal candidates are sampled from $\mathcal{D}_\theta(s_t, g)$ . Only those with $V_\psi(s_t, \tilde{g}_j) > \hat{R}$ (reachability threshold) are retained. The candidate maximizing $V_\psi(\tilde{g}_j, g)$ is selected; if no improvement over $V_\psi(s_t, g)$ , the agent targets the final goal directly. The low-level policy acts toward the selected subgoal for one step, repeating until the global goal is achieved.

4. Empirical Evaluation and Benchmarks

HECRL is evaluated on a suite of sparse-reward, multi-entity manipulation tasks:

PPP-Cube: Pick/place/push with 3 cubes, using state or dual-view RGB observations.
Stack-Cube: Block stacking.
Scene: Drawer, window, button, and cube entities.
Push-Tetris: 2D block pushing to target positions and orientations.

Performance Table (mean $\pm$ std, 4 seeds):

Task	EC-SGIQL	EC-IQL	EC-Diffuser	HIQL	IQL
PPP³-State	82.5±3.1	51.5±4.4	44.8±6.7	48.3±7.3	34.3±4.9
Stack³-State	43.5±1.9	29.0±2.9	43.8±9.2	0.0±0.0	19.3±3.0
PPP³-Image	64.3±4.9	25.0±5.7	0.3±0.5	0.0±0.0	0.0±0.0
Scene-Image	61.5±5.9	53.0±5.5	3.3±2.5	8.3±1.3	17.5±2.7
Push-Tetris (cov)	61.4±3.3	31.6±1.3	7.9±0.5	5.2±0.8	3.4±0.8

Zero-Shot Generalization:

PPP-Cube trained on 3 cubes generalizes to 4/5/6 cubes with EC-SGIQL achieving $65.3\%/49.0\%/25.7\%$ success, compared to EC-IQL's $31.8\%/19.3\%/10.5\%$ .

5. Ablation Studies and Analytical Findings

Subgoal Selection: Ablations demonstrate both the necessity of value-threshold filtering and value-guided selection for high success rates; diffusion-based sampling outperforms AWR-based deterministic high-levels.
- Max-Value (no reachability): $76.3\%$
- Random-Sample (no $V$ -guidance): $73.0\%$
- AWR-based: $67.8\%$
Factor Sparsity: Entity-centric diffusion results in subgoals modifying an average of $1.36 \pm 0.01$ cubes per step (vs $2.96 \pm 0.01$ in AWR), promoting local reachability and efficient decomposition.
Hyperparameter Robustness:
- $K \in \{10, 25, 50, 100\}$ : Robust if $K \geq R^{V_\pi} (\approx 50)$ .
- $T_\text{sg} \in \{10\dots100\}$ : Optimal around $25-50$.
- $N \in \{16, 64, 256, 1024\}$ : Peak performance for $N=64-256$ .

6. Implementation Details and Practical Considerations

Model Architectures:

Low-level: 4-layer MLPs (state-based) or 3-layer Entity Interaction Transformer (hidden dim=256, heads=8).
Subgoal diffuser: 8-layer set-Transformer, hidden dim=256, heads=8.
Vision encoders: DLPv2 (latent particles), VQ-VAE (codebook size 2048, embedding dim=16).

Training Regime:

RL: $2.5$–$3$M gradient steps, Adam ( $3\times10^{-4}$ ), batch $512$, on $4$–$8$ A100 GPUs ( $\sim48$ hours).
Diffusion: $200$K steps, Adam ( $3\times10^{-4}$ ), batch $256$ ( $\sim24$ hours).

Compute:

All stages—including encoder pretraining, RL, and diffusion—fit within a single 8-GPU node.

7. Summary of Approach and Ongoing Directions

HECRL's entity-centric, hierarchical architecture enables scalable RL in environments where direct goal-reaching is difficult due to entity combinatorics and sparse rewards. Its factored diffusion subgoal generator, composed modularly with any value-based GCRL agent, empirically achieves over $150\%$ higher success on benchmark image-based multi-entity manipulation tasks, demonstrates zero-shot scalability with increasing entity counts, and shows minimal sensitivity to key hyperparameters. Future research directions include automatic inference of optimal subgoal horizon $K$ or competence threshold $\hat{R}$ from data, hybridization of diffusion and value-based guidance for subgoal optimality, and extensions to real-world robotics as structured object-centric video representations mature. All code, datasets, and pretrained models are publicly available (Haramati et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Entity-Centric RL (HECRL).