Hierarchical Inverse Q-Learning Framework
- Hierarchical inverse Q-learning frameworks are advanced IRL methods that incorporate latent options and intention segments to model structured expert behavior.
- They utilize semi-MDP formulations, EM algorithms, and Bayesian inference to jointly infer latent rewards and optimal high-level policies.
- Empirical results in grid worlds, web navigation, and transfer learning show improved reward recovery and policy interpretability compared to flat IRL.
A hierarchical inverse Q-learning framework is a class of inverse reinforcement learning (IRL) methodologies that generalize the inverse Q-learning paradigm to settings where expert behavior exhibits temporal abstraction, latent intentions, or explicit option structure. These frameworks introduce additional levels of latent variables (such as options, intentions, or segment boundaries) and associated Q- or value-function computations that reflect the structure of hierarchical or multi-step expert planning. They are distinguished from classical (flat) inverse Q-learning by incorporating either option-based semi-Markov models, latent phase segmentation, or bi-level policy/evaluation architectures. Such frameworks combine advances in the options framework, reward identifiability, and Bayesian inference to capture and infer the hierarchical planning strategies, structure, and reward mechanisms underlying complex expert or animal behavior.
1. Generative Models and Hierarchical Structure
Hierarchical inverse Q-learning frameworks rest on the assumption that expert agents utilize temporally extended or hierarchical policies. The underlying environment is formalized as a Markov decision process (MDP) . Hierarchical models posit an additional set of options , where each option is defined by its initiation set, internal policy , and stochastic termination . Crucially, primitive actions are treated as degenerate options, ensuring consistency with the standard MDP structure (Cundy et al., 2018, Hwang et al., 2019).
Another approach, as in multi-intention IRL, models an agent as exhibiting discrete intention phases: the agent’s trajectory is segmented into contiguous regions, each corresponding to a different latent reward and associated policy (Zhu et al., 2023). This segmentation forms a two-level structure: lower-level Q-functions parameterized by intention and a higher-level segmentation variable (phase index or option).
The generative story for such frameworks is typically: at each (high-level) step, the agent chooses an option (or intention segment), executes its corresponding primitive policy until termination, then repeats. The observed state-action trajectory reveals only low-level actions, while the hierarchical decompositions are latent.
2. Likelihoods, Posterior Inference, and Marginalization
The central inference challenge is to compute, for observed trajectories , the likelihood of the observed data under the hierarchical model and then infer latent reward parameters and structural variables.
In the Bayesian Inverse Hierarchical RL (BIHRL) framework (Cundy et al., 2018), the full posterior is
where the hierarchical likelihood
integrates over all option-trajectory decompositions consistent with the observed low-level state-action sequence, with modeled by Boltzmann policies over options. Priors over rewards and option sets regularize learning or enforce structural assumptions.
In the HIQL (multi-intention) framework (Zhu et al., 2023), the log-likelihood is maximized jointly over per-segment rewards and the latent trajectory segmentation: The posterior over rewards and segmentation is addressed via EM-style alternation between segmentation (E-step, e.g., via dynamic programming) and per-phase inverse Q-learning (M-step).
In the Option Compatible Reward IRL (OCR-IRL) framework (Hwang et al., 2019), the hierarchical likelihood emerges from the structure imposed by observed option trajectories; inference proceeds by satisfying necessary gradient-driven stationarity and optimality criteria in the Q- and reward-feature spaces.
3. Hierarchical Q-Function and Value Computations
Central to hierarchical inverse Q-learning is the computation of value functions and Q-functions at the appropriate level of abstraction. In option-based formulations, the option-level value equations generalize the MDP Bellman equations to a semi-MDP setting:
with defining a Boltzmann-rational high-level policy (Cundy et al., 2018). Value iteration in this augmented state-option space yields self-consistent Q-values required for likelihood computation and posterior evaluation.
Within the HIQL approach (Zhu et al., 2023), each intention segment uses classical Q-learning over observed per-segment transitions, solving
These per-phase Q-functions define segment-specific Boltzmann policies.
OCR-IRL (Hwang et al., 2019) constructs admissible Q-feature spaces for option-based policies via the intra-option and termination gradient constraints, then maps those to reward features through reward shaping and further second-order optimality selection.
4. Inference Algorithms and Practical Implementation
Inference in hierarchical inverse Q-learning frameworks typically involves iterative or sampling-based procedures comprising:
- In BIHRL: Joint MCMC over reward and option set . At each iteration, propose new , solve the semi-MDP Bellman equations to update Q-values and soft policies, marginalize over option-trajectory decompositions to compute the data likelihood, and accept/reject based on the posterior (Cundy et al., 2018). Enumeration or importance sampling addresses latent decomposition; scaling remains computationally challenging due to exponential trajectory decomposition space.
- In HIQL: Coordinate ascent via an EM-style algorithm. The E-step infers optimal segment boundaries given current rewards (using, e.g., dynamic programming to maximize cumulative log-likelihood plus regularization), and the M-step updates per-segment rewards (using standard inverse Q-learning) given segmentation. The process alternates until convergence to a local likelihood optimum under mild regularity conditions (Zhu et al., 2023).
- In OCR-IRL: Feature construction by extracting a minimal Q-feature basis via closed-form linear-algebraic constraints (gradient nullspaces), deriving reward features via intra-option shaping, and finally selecting a unique reward via a quadratic (Hessian-trace-weighted) objective that enforces second-order stationarity and optimality for the empirical expert policies (Hwang et al., 2019). Full pseudocode for feature computation, reward selection, and value propagation is specified.
5. Theoretical Guarantees
Local convergence of the hierarchical EM procedure is established under standard regularity (transition kernel and IRL subproblem convexity) (Zhu et al., 2023). In the BIHRL Bayesian setting, the full posterior integrates over latent decompositions and randomizes segmentation, but in small domains, enumeration is tractable and yields tight posteriors (Cundy et al., 2018).
Identifiability in segmented (HIQL) models is established up to additive constants if state-action visitation covers the relevant domain and segments are sufficiently long. For option-based approaches, the solution space is further regularized by enforcing second-order optimality, yielding a unique reward within the nullspace of Q-feature constraints (Hwang et al., 2019).
In the bi-level CQL-ML framework (for offline IRL), alternation between conservative value estimation and maximum likelihood reward fitting is proven to yield a unique fixed point under mild Lipschitz conditions; the expert’s soft policy is the optimal solution under the recovered reward (Park, 27 Nov 2025).
6. Empirical Performance and Applications
Empirical evaluations demonstrate that hierarchical inverse Q-learning yields substantive advantages in environments where temporal abstraction, latent intentions, or subgoals structure expert behavior.
- Toy Gridworlds and Taxi: BIHRL recovers ground-truth reward more accurately and robustly than flat BIRL as data quantity increases. The marginal posterior mass and policy explainability favor the hierarchical approach (Cundy et al., 2018).
- Web Navigation (Wikispeedia): Incorporation of ~150 top-page options in hierarchical modeling reduces negative log-marginal-likelihood by a factor of two and improves goal-prediction accuracy by ~4% over flat IRL, matching strong hand-crafted baselines (Cundy et al., 2018).
- Mouse Navigation and Behavioral Neuroscience: HIQL segmentation discovers interpretable “Forage → Home” and “Home → Forage” phases. Stepwise intention segmentation yields better prediction and interpretation than continuous-time reward models (DIRL) (Zhu et al., 2023). Bandit and gridworld results further corroborate the advantage when behavior consists of discrete, phase-dependent reward structures.
- Transfer Learning and Noisy Data: OCR-IRL accelerates transfer and remains robust under noisy demonstrations in both discrete (Four-Rooms) and continuous (Car-on-the-Hill) domains, outperforming MaxEnt-IRL, LP Apprenticeship, and Behavioral Cloning (Hwang et al., 2019).
- Offline RL Benchmarks: BiCQL-ML achieves state-of-the-art returns in D4RL (MuJoCo) settings, in both low- and medium-data regimes, outperforming BC, DAC, and ValueDICE (Park, 27 Nov 2025).
| Framework | Domain Type | Key Advantage |
|---|---|---|
| BIHRL (Cundy et al., 2018) | Grid/Web | Posterior accuracy, subgoal prediction |
| HIQL (Zhu et al., 2023) | Behavioral | Interpretable reward phases, segmentation |
| OCR-IRL (Hwang et al., 2019) | Discrete/Cont. | Transfer learning, robustness |
| BiCQL-ML (Park, 27 Nov 2025) | Offline RL | Generalization, return, sample efficiency |
7. Relationship to Flat Inverse Q-Learning
Flat inverse Q-learning and traditional Bayesian IRL posit a single reward structure and infer by explaining expert state-action pairs under a Boltzmann-rational policy . Hierarchical inverse Q-learning extends this by (a) introducing a space of temporally abstract actions (options or intention segments) and (b) marginalizing over latent decompositions, thereby regularizing inference and reflecting the structure of compositional decision making (Cundy et al., 2018, Zhu et al., 2023, Hwang et al., 2019).
Hierarchical frameworks avoid overfitting individual primitive moves, instead matching observed compositionality in expert strategies. This adds an implicit regularization: reward inference does not require each primitive action to appear optimal, so long as the larger-scale option-level (or intention-level) choices are explained by the hierarchical value structure. As a result, reward recovery is more faithful to the demonstrator's true objectives, particularly when experts employ skills, subroutines, or latent shifts in intention that would make flat models appear irrational or highly suboptimal.
References
- "Exploring Hierarchy-Aware Inverse Reinforcement Learning" (Cundy et al., 2018)
- "Multi-intention Inverse Q-learning for Interpretable Behavior Representation" (Zhu et al., 2023)
- "Option Compatible Reward Inverse Reinforcement Learning" (Hwang et al., 2019)
- "BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning" (Park, 27 Nov 2025)