Hierarchical Preference Optimization (HPO)

Updated 4 February 2026

Hierarchical Preference Optimization (HPO) is a framework that aligns neural policies using multi-granular, preference-based supervision at trajectory, group, and step levels.
It employs hierarchical loss decompositions with DPO-style losses and dual-axis curriculum strategies to systematically integrate diverse preference signals.
Empirical results across domains like language models, video generation, and reinforcement learning demonstrate significant performance gains and improved bias-variance tradeoffs.

Hierarchical Preference Optimization (HPO) is a principled framework for leveraging multi-granular, preference-based supervision to align neural policies—such as LLM agents, multimodal models, video generators, and reinforcement learning controllers—to complex, long-horizon or multimodal tasks. HPO systematically integrates disparate preference signals across multiple levels of granularity using hierarchical loss decompositions and curriculum strategies, yielding state-of-the-art empirical and theoretical properties in several domains. At its core, HPO resolves the fundamental granularity mismatch between coarse, trajectory- or instance-level guidance and fine-grained, step- or segment-level credit assignment, by constructing intermediary groupings or structural decompositions and directing optimization with dual- or multi-axis curricula (Gao et al., 26 Sep 2025).

1. Preference Granularity: Multi-Scale Supervision

HPO’s principal innovation is the orchestration of preference data at distinct temporal or structural levels:

Trajectory/Instance-Level Preferences: Preferences between entire episodes or generated samples provide global alignment with task outcomes. In LLM agents, $\tau_w$ (expert) vs. $\tau_l$ (suboptimal) rollouts under shared instructions $u$ are compared using a Bradley–Terry–type loss, supporting outcome-driven training but suffering from low credit assignment resolution (Gao et al., 26 Sep 2025).
Group/Segment/Clip-Level Preferences: Mid-level structures—action ‘groups’ in LLMs, segments in MLLMs, clips in video models—are formed by semantic, heuristic, or context-aware segmentation. Preferences here capture the synergistic effects of multi-step sub-tasks with reduced variance relative to trajectory-level signals and lower bias than step-level signals (see Section 5 on bias-variance trade-off) (Gao et al., 26 Sep 2025, Li et al., 28 May 2025, Huang et al., 17 Apr 2025).
Step/Token/Perceptive-Level Preferences: Atomic decisions (tokens, primitive actions, perceptual details) are contrasted via local preferences; for example, token-level sequential KL divergence aligns every output with reference distributions, regularizing the fine structure of behavior but potentially missing higher-order sub-task dependencies (Gao et al., 26 Sep 2025, Fu et al., 28 Jan 2025).

Table: HPO Granularities and Representative Domains | Level | Domain Example | Typical Objective | |------------|--------------------|---------------------------------------| | Trajectory | LLM/agent synthesis| Logit difference of full rollouts | | Group | Action clusters | Logit difference over action groups | | Step/Token | Single actions/tok | Stepwise suffix/single token KL |

This arrangement enables alignment at scales most appropriate for the inherent compositionality of the task, as shown in LLMs, hierarchical RL, and multimodal reasoning (Gao et al., 26 Sep 2025, Fu et al., 28 Jan 2025, Singh et al., 2024).

2. Hierarchical Loss Decomposition and Optimization

HPO introduces a multi-term objective aggregating losses at each granularity, frequently employing DPO-style (Direct Preference Optimization) negative-log-sigmoid or cross-entropy losses over log-likelihood differences between ‘win’ and ‘lose’ exemplars, calibrated against a reference policy $\pi_{\text{ref}}$ . The general form is:

$\mathcal{L}_{\text{HPO}} = \sum_{i} \lambda_{i} \mathcal{L}_{i}$

where $\mathcal{L}_{i}$ is the DPO-style loss for level $i$ (trajectory, group, step), and $\lambda_{i}$ are stage- or curriculum-adjusted weights (Gao et al., 26 Sep 2025, Huang et al., 17 Apr 2025, Chen et al., 14 Aug 2025). Each loss is explicitly defined:

Trajectory-level: $\mathcal{L}_{\text{traj-DPO}}(\theta) = - \mathbb{E}_{(u, \tau_w, \tau_l)} [ \log \sigma( \beta ( \log \pi_\theta(\tau_w|u) - \log \pi_{\text{ref}}(\tau_w|u) - [\log \pi_\theta(\tau_l|u) - \log \pi_{\text{ref}}(\tau_l|u)] )) ]$
Group-level: $\mathcal{L}_{\text{group-DPO}}(\theta) = - \mathbb{E}_{(c, G_w, G_l)} [ \log \sigma( \beta ( \log \pi_\theta(G_w|c) - \log \pi_{\text{ref}}(G_w|c) - [\log \pi_\theta(G_l|c) - \log \pi_{\text{ref}}(G_l|c)] )) ]$
Step/token-level: Similar form, operating on suffixes or tokens.

In particular, the group-level loss achieves a fundamental statistical trade-off: for a group length $k=O(\log(1/\epsilon))$ , group-DPO exhibits bias less than the minimum of trajectory- or step-DPO plus $\epsilon$ and variance bounded by $C \log(1/\epsilon)/T$ times the smaller of the trajectory or step DPO variances, where $T$ is horizon length (Gao et al., 26 Sep 2025).

Curriculum schedulers—either staged (as in (Gao et al., 26 Sep 2025)) or weighted (as in (Wu et al., 21 Oct 2025))—further prioritize samples according to group complexity (length) and discriminability (reward gap), incrementally advancing training from easy, simple preferences to hard, compound ones.

3. Architectural and Algorithmic Instantiations

Implementations of HPO follow a general pipeline:

Reference model initialization: $\pi_{\text{ref}}$ is typically a supervised fine-tuned (SFT) model on expert data.
Data construction: Preference triplets or tuples are sampled at all relevant granularities, with losing samples generated from $\pi_{\text{ref}}$ via controlled rollouts or perturbations. Group or segment extraction employs semantic clustering, uncertainty-based cuts, or fixed windows (Gao et al., 26 Sep 2025, Chen et al., 14 Aug 2025, Li et al., 28 May 2025).
Bucketing and curriculum: Group-level samples are sorted by length and reward gap into buckets; a scheduler selectively exposes them to the main objective in distinct phases (simple to complex) (Gao et al., 26 Sep 2025). Alternatively, adaptive weighting based on intra/inter-group diversity may be used (Wu et al., 21 Oct 2025).
Hierarchical optimization: Staged or simultaneous minibatch sampling from each relevant preference pool, followed by parameter updates via sum of hierarchical losses. Cross-modal objectives are included as required in multimodal/video domains (Fu et al., 28 Jan 2025, Huang et al., 17 Apr 2025, Chen et al., 14 Aug 2025).

Pseudocode instantiations are given for LLMs (Gao et al., 26 Sep 2025), diffusion models (Chen et al., 14 Aug 2025), video-text models (Huang et al., 17 Apr 2025), super-resolution (Wu et al., 21 Oct 2025), HRL (Singh et al., 2024, Singh et al., 2024), and MLLMs (Fu et al., 28 Jan 2025, Li et al., 28 May 2025).

4. Domain-Specific Applications and Empirical Results

4.1 Language Agents

Hierarchical Preference Optimization (as Hierarchical Preference Learning, HPL) for long-context LLM agents leverages trajectory, group, and step preferences with a dual-axis curriculum scheduler. Empirical ablations demonstrate that excluding group-DPO yields the largest performance drop and that both curriculum axes (length and difficulty) are required for optimal agent performance on multi-step benchmarks (Gao et al., 26 Sep 2025).

4.2 Multimodal and Video Domains

In CHiP and VistaDPO, HPO generalizes to multimodal LLMs and large video models via instance, segment/temporal, and token/perceptive-level alignment. This approach leads to significant reductions in hallucination and large relative gains in question answering and captioning performance (Fu et al., 28 Jan 2025, Huang et al., 17 Apr 2025). In PhysHPO, four levels (instance, state, motion, semantic) are optimized for physically-plausible video generation, with ablations attributing performance gains to the presence of each distinct level (Chen et al., 14 Aug 2025).

4.3 Image Super-Resolution

In DP $^2$ O-SR, HPO adaptively weights DPO losses across multiple grouped model rollouts, focusing learning on pairwise image preferences with large intra-group reward gaps and emphasizing groups with high inter-group diversity. This yields improved perceptual alignment and output stability, especially for smaller models (Wu et al., 21 Oct 2025).

4.4 Hierarchical Reinforcement Learning

Hierarchical Preference Optimization in RL contexts transforms the bi-level HRL problem into a primitive-regularized DPO objective, ensuring feasibility of subgoals and mitigating non-stationarity. Experimental results show up to 35% gains over strong HRL baselines in sparse-reward tasks, and ablations validate the necessity of lower-level value regularization (Singh et al., 2024, Singh et al., 2024).

5. Theoretical Properties: Bias-Variance and Feasibility

A key theoretical result establishes that group-level DPO realizes a middle ground in the bias-variance spectrum for preference loss. For appropriate group size $k$ , bias is closely tied to the better of trajectory- or step-level DPO, while variance is significantly reduced for long-horizon tasks (Gao et al., 26 Sep 2025). In HRL, the Lagrangian regularization in HPO enables provable feasibility: subgoals are penalized according to the lower-level policy’s value, enforcing attainability and ruling out infeasible or degenerate solution modes (Singh et al., 2024, Singh et al., 2024).

6. Practical Implementation and Ablation Insights

Practical deployment of HPO hinges on structured curriculum design and careful segmentation/grouping algorithms. Empirical studies consistently show:

Largest performance drops occur when group-/mid-level losses are ablated.
Static or single-axis curricula are markedly inferior to dual-axis staged schedules (Gao et al., 26 Sep 2025).
Semantic segmentation for group extraction outperforms simple or uncertainty-based approaches.
In image, video, and multimodal tasks, the inclusion of each hierarchical level contributes distinct, additive gains, with cross-modal and semantic levels providing critical improvements in alignment quality and reduction of hallucination (Chen et al., 14 Aug 2025, Huang et al., 17 Apr 2025).
Adaptive intra-/inter-group weighting in image super-resolution robustifies training and accelerates convergence (Wu et al., 21 Oct 2025).

7. Extension and Significance

The HPO framework has seen broad extension across autonomy, multimodal understanding, generative modeling, and control. By codifying a general recipe—identify a natural hierarchy of structural granularity, construct preference pairs at each level, and jointly optimize a weighted sum of DPO-style losses—HPO enables robust credit assignment and alignment in settings where reward engineering, dense supervision, or scalar feedback is infeasible.

As the scope of AI models and tasks continues to scale in complexity and modality, HPO provides a flexible, theoretically-motivated, and empirically validated architecture for leveraging preference data of varying granularity to ensure consistent alignment with human intent and task objectives (Gao et al., 26 Sep 2025, Wu et al., 21 Oct 2025, Singh et al., 2024, Singh et al., 2024).