Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Entity-Centric RL (HECRL)

Updated 5 February 2026
  • HECRL is a two-level framework that integrates a low-level value-based agent with a high-level conditional diffusion model to decompose complex tasks.
  • It generates sparse, entity-factored subgoals that simplify long-horizon, multi-entity manipulation tasks and improve reachability.
  • Empirical evaluations demonstrate that HECRL outperforms baseline methods in multi-entity settings and exhibits strong zero-shot generalization.

Hierarchical Entity-Centric Reinforcement Learning (HECRL) is a modular, two-level framework for offline goal-conditioned reinforcement learning (GCRL), designed to address the combinatorial and temporal complexity of long-horizon manipulation tasks in domains populated by multiple interacting entities. HECRL decomposes the global goal into sparse, entity-factorized subgoals and integrates conditional diffusion-based subgoal generation with a value-based GCRL agent, yielding marked performance gains in sparse-reward, high-dimensional domains and enabling robust generalization with respect to increasing numbers and arrangements of entities (Haramati et al., 2 Feb 2026).

1. Two-Level Hierarchical Framework

HECRL employs a compositional structure comprising (1) a low-level value-based GCRL agent and (2) a high-level entity-factored subgoal diffusion generator.

Low-Level GCRL Agent:

The agent operates on the goal-conditioned Markov Decision Process (S,A,μ,p,r)(S, A, \mu, p, r), where the goal space G=S\mathcal{G} = \mathcal{S} and reward is sparse (r(s,g)=1s=g1r(s, g) = 1_{s=g} - 1). Its architecture includes:

  • A Q-network Qϕ(s,a,g)Q_\phi(s,a,g) trained via offline data.
  • An implicit policy π(as,g)\pi(a|s,g) extracted using Deep Deterministic Policy Gradient with Behavioral Cloning (DDPG+BC).
  • A value network Vψ(s,g)=maxaQϕ(s,a,g)V_\psi(s,g) = \max_a Q_\phi(s,a,g). The competence radius RVπR^{V_\pi} defines the maximal subgoal distance the agent can reliably traverse without significant temporal-difference (TD) error accumulation.

High-Level Subgoal Diffuser:

This component uses a conditional diffusion model Dθ:(s,g)g~\mathcal{D}_\theta : (s,g) \rightarrow \tilde{g}, which, given the current state and final goal, produces an intermediate subgoal g~\tilde{g} expected to be reachable within KK steps. States and goals are represented in a factored form s=(s1,,sM)s = (s_1,\ldots,s_M), g=(g1,,gM)g = (g_1,\ldots,g_M), with a set-Transformer architecture employed to denoise individual entity subgoal factors, encouraging subgoals that typically modify only a few entities.

2. Mathematical Formulation and Model Architectures

Factored Spaces and Reward Structure

  • State space: S=S1×S2××SMS = S_1 \times S_2 \times \dots \times S_M, component-wise structured.
  • Action space: AA (e.g., gripper pose changes for robotic manipulation).
  • Goal space and reward: G=S\mathcal{G} = S, r(s,g)=0r(s, g) = 0 if s=gs = g, 1-1 otherwise (strictly sparse).

Value and Policy Objectives

LV(ψ)=E(s,a,s,g)D[Vψ(s,g)yτ2],y=r(s,g)+γVψ(s,g),L_V(\psi) = \mathbb{E}_{(s, a, s', g) \sim D}\left[|V_\psi(s, g) - y|_\tau^2\right], \quad y = r(s,g) + \gamma V_\psi(s',g),

with xτ2=τ1x<0x2|x|^2_\tau = |\tau - 1_{x < 0}| x^2.

  • Q-Function Loss: Regression to TD targets,

LQ(ϕ)=E(s,a,s,g)[(Qϕ(s,a,g)(r+γVψ(s,g)))2]L_Q(\phi) = \mathbb{E}_{(s,a,s',g)}[(Q_\phi(s,a,g) - (r + \gamma V_\psi(s',g)))^2]

  • Policy Extraction: DDPG+BC objective,

Lπ=E(s,a,g)[Qϕ(s,πθ(s,g),g)+απθ(s,g)a2]L_\pi = \mathbb{E}_{(s, a, g)}\left[-Q_\phi(s,\pi_\theta(s,g),g) + \alpha \|\pi_\theta(s, g) - a\|^2\right]

Subgoal Diffusion Generation

  • Training Data: Triplets (s,g~,g)(s, \tilde{g}, g) obtained by sampling s=sts = s_t, final goal g=stgg = s_{t_g}, and intermediate subgoal g~=smin(t+K,tg)\tilde{g} = s_{\min(t+K, t_g)} from offline trajectories.
  • Diffusion Process: Forward noise process

q(g~0)=data,q(g~τg~τ1)=N(1βτg~τ1,βτI)q(\tilde{g}^0) = \text{data}, \quad q(\tilde{g}^\tau|\tilde{g}^{\tau-1}) = \mathcal{N}(\sqrt{1-\beta_\tau}\tilde{g}^{\tau-1}, \beta_\tau I)

for τ=1T\tau=1\dots T; denoiser εθ\varepsilon_\theta is trained with

Ldiff(θ)=Es,g,g~,τ,ε[εεθ(g~τ,s,g,τ)2]L_\text{diff}(\theta) = \mathbb{E}_{s,g,\tilde{g},\tau,\varepsilon}[\|\varepsilon - \varepsilon_\theta(\tilde{g}^\tau, s, g, \tau)\|^2]

  • Generation: At test time, TT reverse steps yield NN candidate subgoals per query.

3. Training Procedures and Test-Time Composition

Independent Training of Components

  • RL Agent: Dataset DD with 3\sim 3M transitions. Approximated using Adam optimizer (3×1043 \times 10^{-4} learning rate), batch size $512$, $2$–$3$M updates, γ=0.99\gamma=0.99, expectile τ=0.9\tau=0.9, DDPG+BC α[0.05,0.3]\alpha \in [0.05, 0.3].
  • Diffusion Subgoal Generator: Uses same data DD for (s,g~,g)(s, \tilde{g}, g) examples with K=50K=50; 8-layer set-Transformer, hidden dim $256$, $8$ heads, T=10T=10 diffusion steps, $200K$ optimization steps.

Test-Time Subgoal Selection and Execution

  • On every TsgT_\text{sg} step interval, NN subgoal candidates are sampled from Dθ(st,g)\mathcal{D}_\theta(s_t, g). Only those with Vψ(st,g~j)>R^V_\psi(s_t, \tilde{g}_j) > \hat{R} (reachability threshold) are retained. The candidate maximizing Vψ(g~j,g)V_\psi(\tilde{g}_j, g) is selected; if no improvement over Vψ(st,g)V_\psi(s_t, g), the agent targets the final goal directly. The low-level policy acts toward the selected subgoal for one step, repeating until the global goal is achieved.

4. Empirical Evaluation and Benchmarks

HECRL is evaluated on a suite of sparse-reward, multi-entity manipulation tasks:

  • PPP-Cube: Pick/place/push with 3 cubes, using state or dual-view RGB observations.
  • Stack-Cube: Block stacking.
  • Scene: Drawer, window, button, and cube entities.
  • Push-Tetris: 2D block pushing to target positions and orientations.

Performance Table (mean±\pmstd, 4 seeds):

Task EC-SGIQL EC-IQL EC-Diffuser HIQL IQL
PPP³-State 82.5±3.1 51.5±4.4 44.8±6.7 48.3±7.3 34.3±4.9
Stack³-State 43.5±1.9 29.0±2.9 43.8±9.2 0.0±0.0 19.3±3.0
PPP³-Image 64.3±4.9 25.0±5.7 0.3±0.5 0.0±0.0 0.0±0.0
Scene-Image 61.5±5.9 53.0±5.5 3.3±2.5 8.3±1.3 17.5±2.7
Push-Tetris (cov) 61.4±3.3 31.6±1.3 7.9±0.5 5.2±0.8 3.4±0.8

Zero-Shot Generalization:

PPP-Cube trained on 3 cubes generalizes to 4/5/6 cubes with EC-SGIQL achieving 65.3%/49.0%/25.7%65.3\%/49.0\%/25.7\% success, compared to EC-IQL's 31.8%/19.3%/10.5%31.8\%/19.3\%/10.5\%.

5. Ablation Studies and Analytical Findings

  • Subgoal Selection: Ablations demonstrate both the necessity of value-threshold filtering and value-guided selection for high success rates; diffusion-based sampling outperforms AWR-based deterministic high-levels.
    • Max-Value (no reachability): 76.3%76.3\%
    • Random-Sample (no VV-guidance): 73.0%73.0\%
    • AWR-based: 67.8%67.8\%
  • Factor Sparsity: Entity-centric diffusion results in subgoals modifying an average of 1.36±0.011.36 \pm 0.01 cubes per step (vs 2.96±0.012.96 \pm 0.01 in AWR), promoting local reachability and efficient decomposition.
  • Hyperparameter Robustness:
    • K{10,25,50,100}K \in \{10, 25, 50, 100\}: Robust if KRVπ(50)K \geq R^{V_\pi} (\approx 50).
    • Tsg{10100}T_\text{sg} \in \{10\dots100\}: Optimal around $25-50$.
    • N{16,64,256,1024}N \in \{16, 64, 256, 1024\}: Peak performance for N=64256N=64-256.

6. Implementation Details and Practical Considerations

Model Architectures:

  • Low-level: 4-layer MLPs (state-based) or 3-layer Entity Interaction Transformer (hidden dim=256, heads=8).
  • Subgoal diffuser: 8-layer set-Transformer, hidden dim=256, heads=8.
  • Vision encoders: DLPv2 (latent particles), VQ-VAE (codebook size 2048, embedding dim=16).

Training Regime:

  • RL: $2.5$–$3$M gradient steps, Adam (3×1043\times10^{-4}), batch $512$, on $4$–$8$ A100 GPUs (48\sim48 hours).
  • Diffusion: $200$K steps, Adam (3×1043\times10^{-4}), batch $256$ (24\sim24 hours).

Compute:

All stages—including encoder pretraining, RL, and diffusion—fit within a single 8-GPU node.

7. Summary of Approach and Ongoing Directions

HECRL's entity-centric, hierarchical architecture enables scalable RL in environments where direct goal-reaching is difficult due to entity combinatorics and sparse rewards. Its factored diffusion subgoal generator, composed modularly with any value-based GCRL agent, empirically achieves over 150%150\% higher success on benchmark image-based multi-entity manipulation tasks, demonstrates zero-shot scalability with increasing entity counts, and shows minimal sensitivity to key hyperparameters. Future research directions include automatic inference of optimal subgoal horizon KK or competence threshold R^\hat{R} from data, hybridization of diffusion and value-based guidance for subgoal optimality, and extensions to real-world robotics as structured object-centric video representations mature. All code, datasets, and pretrained models are publicly available (Haramati et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Entity-Centric RL (HECRL).