Preference Transformer Architecture

Updated 17 February 2026

Preference Transformer is a neural sequence model that infers and models user preferences using transformer architectures to capture long-range non-Markovian dependencies.
It employs multi-scale temporal modeling, hierarchical attention, and interaction-level graphs to process heterogeneous inputs and uncover fine-grained behavioral patterns.
The model outperforms traditional methods in recommendation and reinforcement learning by enabling dynamic, interpretable, and efficient preference-based reward assignments.

A Preference Transformer is a neural sequence modeling architecture designed to infer, represent, and exploit user or human preferences from observed behaviors or anthropogenic feedback. The approach leverages the capacity of transformer models to capture long-range dependencies and heterogeneous signals, enabling adaptive, non-Markovian, and multi-granular preference modeling across domains such as sequential recommendation and preference-based reinforcement learning.

1. Motivation and Conceptual Evolution

Historically, modeling user preference in sequential settings was limited by Markovian or uniformly aggregated schemes—ignoring interaction heterogeneity, temporal structure, and intra/inter-modality dependencies. In recommender systems, prior models collapsed all events either into single-typed (losing behavioral heterogeneity), behavior-level (pooling e.g. all ‘clicks'), or item-level (pooling across all events of the same item) sequences, thus failing to encode fine-grained interaction-level dependencies. In preference-based RL, reward models assumed reward functions or preferences as Markovian or dependent on time-uniform weighting, which poorly reflected human judgment driven by salient events or temporally extended context.

The Preference Transformer family of models addresses these limitations via multi-level graph modeling, multi-granular and hierarchical attention, and explicit design for heterogeneous input modalities (He et al., 2024, Huang et al., 2024, Zhao et al., 2024, Kim et al., 2023).

2. Architectural Principles and Key Methodologies

The core principles of Preference Transformer architectures are:

Interaction-Level and Multi-Order Dependency Modeling: Captures nuanced correlations between distinct (item, behavior) pairs (e.g., "purchase→click" or "click→cart") at the graph node level by constructing a fully-connected interaction graph and applying graph convolution. This approach overcomes the limitations of pooling at the behavior or item level (He et al., 2024).
Multi-Scale and Multi-Grained Temporal Modeling: Utilizes transformer blocks capable of encoding preferences at multiple temporal granularities, uncovering both session-level (coarse) and intra-session (fine) dynamics (He et al., 2024, Huang et al., 2024).
Hierarchical and Multi-Stream Processing: Employs parallel transformer structures to simultaneously track low-level preferences (e.g., item-IDs) and high-level preferences (e.g., categories), capturing slow- and fast-changing signals (Huang et al., 2024).
Explicit Preference Weighting and Non-Markovian Rewards: Integrates attention-based mechanisms to learn weighted, non-Markovian rewards or preferences, allowing the model to emphasize decision-critical events in trajectories, a critical aspect for reward inference in RL (Kim et al., 2023, Zhao et al., 2024).
Multimodal Sequence Modeling: Decomposes trajectories in RL or user sessions in recommendation into modality-specific streams (e.g., state/action), fuses them via intra- and inter-modal (cross-attention) transformer blocks, and aggregates for preference prediction (Zhao et al., 2024).

3. Key Model Components and Algorithms

Constructs sequence $S = \{(v_1, b_1), ..., (v_T, b_T)\}$ with embeddings $e_i$ for items and $b_i$ for behaviors.
Forms interaction-aware embedding $h_i = e_i + b_i$ , stacked into $H^{(0)}$ .
Adjacency matrix $A$ encodes cross-type affinities via element-wise and inner products of $e_i$ , $b_i$ .
Multi-order dependencies are computed by $L$ -layer graph convolution:

$H^{(l+1)} = \mathrm{LeakyReLU}( D^{-1/2} \tilde{A} D^{-1/2} H^{(l)} W^{(l)} )$

Applies positional encoding to $H^{(l)}$ to get $\tilde{H}^{(l)}$ .
Linearized global self-attention computes:

$Q = \operatorname{elu}(\tilde{H}W_Q), \quad K = \operatorname{elu}(\tilde{H}W_K), \quad V = \tilde{H}W_V$

$H_{\text{lin}}^{(l)} = Q' (K'^T V)$

Multi-grained preference modeling partitions the sequence into sessions of length $t$ and queries short-term preferences at multiple granularity levels using session-based multi-head attention and $L_p$ -pooling across granularities.
Final output concatenates global and multi-resolution session outputs, refines with transformer feed-forward layers.

Dual-Transformer modules process item-ID and category sequences in parallel via self-attention and feed-forward layers, yielding low- ( $v_f$ ) and high-level ( $c_f$ ) preference vectors.
Semantics-enhanced embeddings for target items or categories leverage time-decayed relational signals (e.g., “also_buy”, “also_view”) for context adaptation.

Splits robot/agent trajectory $\sigma$ into state and action streams, encodes each via intra-modal (causal self-attention) transformers.
Fuses outputs via inter-modal (cross-attention) transformer layers to model complex state-action dependencies for preference modeling.
Final trajectory reward is computed via mean-pooling of multimodal representations; preference likelihood follows a Bradley-Terry softmax.

Causal transformer backbone produces non-Markovian rewards $\hat{r}_t$ ; preference attention layer computes attention weights $w_t$ over the trajectory, building the overall preference score as $R(\sigma) = \sum_{t} w_t \hat{r}_t$ .
Preference likelihood is computed as:

$P[\sigma^1 \succ \sigma^0] = \frac{\exp(\sum_t w_t^{(1)} \hat{r}_t^{(1)})}{\sum_{j \in \{0,1\}} \exp(\sum_t w_t^{(j)} \hat{r}_t^{(j)})}$

4. Training Regimes and Objective Functions

Preference Transformer variants are trained via supervised learning on human or user feedback or via direct item prediction. The principal objectives include:

Cloze-style Masked Prediction (He et al., 2024): Masks randomly chosen interactions; model predicts masked items conditioned on historical context using cross-entropy across all dependency orders, regularized for graph sparsity and model complexity.
Contrastive Learning (Huang et al., 2024): Employs dual InfoNCE-style contrastive losses at both item and category levels to enhance discrimination between user preferences, jointly with a BPR ranking loss.
Preference-based Pairwise Losses (Zhao et al., 2024, Kim et al., 2023): Trains by cross-entropy minimizing disagreement with human preference labels, using Bradley-Terry likelihoods for sampled trajectory pairs.
Regularization and Optimization: All approaches use Adam or AdamW, commonly with dropout and various regularization (L1, L2, early stopping, etc.).

5. Distinguishing Properties and Performance Insights

Preference Transformer models are characterized by several innovations:

Fine-Grained Dependency Modeling: Unlike prior art pooling on the item or behavior level, the interaction-level graph and multi-order convolutions model cross-type chains of influence and complex behavioral heterogeneity (He et al., 2024).
Multi-Granularity and Hierarchical Attention: Local and global sequential patterns are captured over multiple temporal scales or feature hierarchies (He et al., 2024, Huang et al., 2024).
Preference Weighting for Non-Markovian Credit Assignment: Dynamic importance assigned to trajectory segments aligns with human judgment in RL and supports interpretable attention highlighting (Kim et al., 2023, Zhao et al., 2024).
Modality-Aware Fusion: In RL domains, multi-stream and cross-attention mechanisms enable the reward model to jointly reason over action and state space, outperforming Markovian or single-modality baselines (Zhao et al., 2024).
Empirical Superiority: Across sequential recommendation (Amazon subsets, HR@K, NDCG@K) and RL (D4RL, Meta-World, expert-normalized score, success rate), Preference Transformer baselines consistently outperform single-stream, Markovian, or shallow alternatives (He et al., 2024, Huang et al., 2024, Zhao et al., 2024, Kim et al., 2023).

6. Applications, Interpretability, and Future Directions

Preference Transformers are deployed in:

Multi-Behavior Sequential Recommendation: Multi-grained models robustly address behavioral heterogeneity and session dynamics, advancing state-of-the-art on large-scale e-commerce datasets (He et al., 2024).
Reinforcement Learning from Human Preferences: Non-Markovian, attention-weighted reward inference enables robust offline policy learning via models aligned with nuanced human feedback (Kim et al., 2023, Zhao et al., 2024).
Interpretability: Learned attention weights naturally highlight decisive interaction or trajectory segments, enabling model transparency and guiding future data collection or query selection (Kim et al., 2023, Zhao et al., 2024).
Scalability and Efficiency: Linearized attention and sessioned processing make Preference Transformers tractable for long trajectories, with complexity on par with vanilla transformer recommender models but with significant semantic enrichment (He et al., 2024).

Future directions include scaling to additional data modalities (vision, audio, language), active preference sampling to reduce annotation burden, integrating into online preference-in-the-loop learning, and hybridizing with regret or successor-feature predictors for greater robustness (Zhao et al., 2024, Kim et al., 2023).

References:

Multi-Grained Preference Enhanced Transformer for Multi-Behavior Sequential Recommendation (He et al., 2024)
Dual Contrastive Transformer for Hierarchical Preference Modeling in Sequential Recommendation (Huang et al., 2024)
PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers (Zhao et al., 2024)
Preference Transformer: Modeling Human Preferences using Transformers for RL (Kim et al., 2023)

Markdown Report Issue Upgrade to Chat

References (4)

Multi-Grained Preference Enhanced Transformer for Multi-Behavior Sequential Recommendation (2024)

Dual Contrastive Transformer for Hierarchical Preference Modeling in Sequential Recommendation (2024)

PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers (2024)

Preference Transformer: Modeling Human Preferences using Transformers for RL (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Preference Transformer.

Preference Transformer Architecture

1. Motivation and Conceptual Evolution

2. Architectural Principles and Key Methodologies

3. Key Model Components and Algorithms

3.1 Interaction-Level Graph and Convolution (He et al., 2024)

3.2 Multi-Scale Transformer Temporal Modeling (He et al., 2024, Huang et al., 2024)

3.3 Hierarchical Preference Streams (Huang et al., 2024)

3.4 Multimodal Transformer for Preference RL (Zhao et al., 2024)

3.5 Preference Weighting and Non-Markovian Rewards (Kim et al., 2023)

4. Training Regimes and Objective Functions

5. Distinguishing Properties and Performance Insights

6. Applications, Interpretability, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Preference Transformer Architecture

1. Motivation and Conceptual Evolution

2. Architectural Principles and Key Methodologies

3. Key Model Components and Algorithms

3.1 Interaction-Level Graph and Convolution (He et al., 2024)

3.2 Multi-Scale Transformer Temporal Modeling (He et al., 2024, Huang et al., 2024)

3.3 Hierarchical Preference Streams (Huang et al., 2024)

3.4 Multimodal Transformer for Preference RL (Zhao et al., 2024)

3.5 Preference Weighting and Non-Markovian Rewards (Kim et al., 2023)

4. Training Regimes and Objective Functions

5. Distinguishing Properties and Performance Insights

6. Applications, Interpretability, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics