Markovian Pre-trained Transformer (MPT)

Updated 14 January 2026

MPT is a universal sequential recommendation model that pre-trains on synthetic Markov chains to emphasize the most recent user interactions.
The architecture combines a Transformer backbone with a lightweight adaptor, enabling effective adaptation to diverse recommendation datasets.
Empirical results show significant accuracy gains over traditional models, validating the method's scalability and efficiency.

The Markovian Pre-trained Transformer (MPT) is a universal, transferable recommendation model that achieves state-of-the-art next-item recommendation performance through a novel pre-training regime on synthetic Markov chains and the use of a lightweight adaptor. MPT exploits the empirical finding that advanced sequential recommenders primarily utilize the most recent interaction—reflecting a Markovian, first-order dependency—while earlier interactions serve as non-sequential signals for inferring user identity. By pre-training a Transformer backbone on synthetic Markov chain data and adapting with minimal parameters, MPT combines scalability, universality, and competitive accuracy for sequential recommendation tasks (Xu et al., 13 Jan 2026).

1. Markovian Principle in Sequential Recommendation

A crucial empirical insight guiding MPT is the “Markovian” nature observed across advanced sequential recommendation models. Given a user interaction sequence $[v_1,\ v_2,\dots,v_t]$ , typical models estimate $\Pr(v_{t+1}\mid v_t, v_{t-1},\dots,v_1;\Theta)$ . However, experiments demonstrate that shuffling all but the last interaction $\{v_1,\dots,v_{t-1}\}$ hardly degrades next-item prediction accuracy; in contrast, shuffling the entire sequence reduces performance, but the result still relies disproportionately on the final item.

This empirical property is formalized as:

$\Pr(v_{t+1}\mid v_t,v_{t-1},\ldots,v_1) \approx \Pr\bigl(v_{t+1}\mid v_t,\{v_1,\dots,v_{t-1}\}\bigr),$

i.e., next-item prediction is nearly first-order Markovian, where $v_t$ captures short-term intent and $\{v_1,\dots,v_{t-1}\}$ aggregates into a user profile. Thus, sequence modeling in recommendation decomposes into two capabilities:

Summarizing general user preferences from historical interactions (profile).
Emphasizing the last interaction for immediate next-item prediction (Xu et al., 13 Jan 2026).

2. Model Architecture and Pre-training

2.1 Transformer Backbone

The MPT employs a standard Transformer architecture:

Input Encoding: Each state $s_t$ is projected into a continuous embedding space.
Positional Encoding: Standard absolute or rotary embeddings inject sequence position.
Multi-Head Self-Attention (MHSA):

$\operatorname{Attention}(Q,K,V) = \operatorname{softmax}\Bigl(\frac{QK^\top}{\sqrt{d_k}}\Bigr)V$

used in $L$ stacked layers with $h$ attention heads.

Feed-Forward Network (FFN): Per-layer FFN comprised of two linear projections and a nonlinearity (e.g., GeLU).
Residual/RMSNorm: Each block is equipped with RMSNorm and residual connections.
Typical hyperparameters: $L = 4$ , $d = 256$ , $h = 2$ (Xu et al., 13 Jan 2026).

2.2 Markov Chain Pre-training

MPT is pre-trained exclusively on synthetic Markov chains:

Trajectory Generation: Infinite-length trajectories $(s_1,\dots,s_T)$ are generated via random $|\mathcal S|\times|\mathcal S|$ transition matrices $\mathbf P$ ; each row $\mathbf p_i$ follows $\text{Dir}(\alpha,\dots,\alpha)$ with $\alpha\approx 0.05$ for sparsity.
Trajectory length: Fixed $T=1024$ .
Input diversity: Each trajectory employs a fresh random orthogonal projection from one-hot states to vector inputs.
Objective: Next-State Prediction (NSP),

$\mathcal{L}_{\mathrm{NSP}}(\Theta) = \mathbb{E}_{\mathbf P\sim\mathrm{Dir}(\alpha)}\; \mathbb{E}_{(s_t)\sim \mathbf P} \left[-\sum_{t=1}^{T-1} \log \Pr(s_{t+1}|s_t,\dots,s_1;\Theta)\right],$

effectively enforcing the model to estimate state transition probabilities and to sharply attend to the current state.

Pre-training on synthetic Markov data offers architectural universality and exposes the model to unlimited, controlled trajectories (Xu et al., 13 Jan 2026).

3. Fine-tuning via Lightweight Adaptor

3.1 Adaptor Module

For alignment with recommendation tasks and datasets, MPT introduces a lightweight input adaptor while keeping the pre-trained backbone frozen. The adaptor, inserted at each time step, consists of:

RMSNorm
Linear $\rightarrow$ LeakyReLU
Linear

This adaptor maps item semantic features (e.g., text embeddings) into the token space of MPT. The additional parameter count $\phi$ is only a few percent of the frozen parameter set $\Theta$ (Xu et al., 13 Jan 2026).

3.2 Fine-tuning and Prediction

Fine-tuning maximizes the likelihood of the next item:

$\mathcal{L}_{\mathrm{NIP}}(\Theta,\phi) = -\sum_{(u, v_1, \dots, v_t)\in\mathcal{D}_\mathrm{rec}} \log \Pr\left(v_{t+1}\mid v_t,\dots,v_1,u; \Theta,\phi\right)$

where $\Pr(\cdot)$ is implemented as a cosine similarity with a learnable temperature. With $\Theta$ frozen, only $\phi$ is trained, encouraging universal adaptation to varying item spaces while retaining generalizable sequence summarization and last-item emphasis (Xu et al., 13 Jan 2026).

4. Synthetic Data Generation

MPT’s pre-training regime leverages the following synthetic Markov chain configuration:

State space: $|\mathcal S|\approx 30$
Dirichlet $\alpha$ : $0.05$ to induce sparse transitions
Trajectory Length: $T=1024$
Input transformation: New random orthogonal projections per trajectory to prevent memorization
Scale: On-the-fly trajectory generation for approximately 10 billion tokens.

This pipeline enables unlimited, diverse trajectories, controlling transition sparsity and state-space cardinality to match real-world recommendation environments (Xu et al., 13 Jan 2026).

5. Empirical Performance and Analysis

5.1 Datasets and Evaluation Metrics

MPT is evaluated on five public datasets:

Dataset	Users	Items	Interactions	Avg. Length
Beauty	22,363	12,101	198,502	8.9
Toys	19,412	11,924	167,597	8.6
Sports	35,598	18,357	296,337	8.3
Yelp	77,277	45,638	2,103,896	27.2
Online Retail	16,520	3,469	519,906	31.5

Metrics: HR@ $K$ , NDCG@ $K$ ( $K$ = 10, 20) (Xu et al., 13 Jan 2026).

5.2 Baselines and Results

Benchmarked methods:

From-scratch: GRU4Rec, SASRec, FMLPRec, HSTU, SASRec+
Pre-trained recommenders: UniSRec, RecFormer
LLMs: E4SRec (LLaMA-2-based), Qwen2.5-7B

MPT+Adaptor achieves, on average,

$\approx 12\%$ relative gain over the strongest from-scratch model (SASRec+)
$\approx 35\%$ gain over the best pre-trained baseline

For instance, Beauty (NDCG@10):

Model	NDCG@10
SASRec+	0.0591
UniSRec	0.0382
MPT+Adaptor	0.0673

(+13.8% vs. SASRec+, +43.3% vs. UniSRec). Comparable gains are reported across Toys, Sports, Yelp, and Online Retail (Xu et al., 13 Jan 2026).

6. Ablations and Behavioral Analysis

Ablation experiments support the centrality of Markovian pre-training:

Pre-training effect: Removing pre-training reduces the model to SASRec+, while including it yields consistent $10$– $15\%$ improvements.
Fine-tuning: Adaptor-only tuning suffices in most datasets. LoRA-based strategies may overfit, except on Yelp where LoRA offers marginal benefit.
Synthetic sequence parameters: $T\approx 1024$ (trajectory length) and $|\mathcal S|\approx30$ (state size) are empirically optimal. Sparsity ( $\alpha=0.05$ ) matches real-world data.
Attention analysis: Lower layers of MPT maintain sharp diagonal attention—prioritizing the most recent interaction, in contrast to pre-trained LLMs, which produce diffuse, less effective attention for recommendation (Xu et al., 13 Jan 2026).

7. Contributions, Limitations, and Prospects

MPT’s primary contributions:

Empirical demonstration that advanced sequential recommenders display first-order, Markovian behavior.
Introduction of MPT: a Transformer pre-trained solely on synthetic Markov chains, yielding universal sequence summarization and robust short-term preference modeling.
State-of-the-art performance with minimal adaptation, exceeding both specialist recommenders and large LLM-based systems at lower cost.

Noted limitations include:

Restriction to first-order dynamics and absence of higher-order or side-channel information in pre-training.
Occasional requirement for more flexible parameter-efficient adaptation (e.g., LoRA) in complex domains.
Underfitting in domains with highly specialized or extreme user behaviors.

Future research directions include:

Augmenting synthetic data to reflect higher-order dependencies, rich side-info, or variable Markov orders.
Synthetic user simulation via LLM-based agents for more realistic data.
Extending pre-training to multi-task collaborative settings (ratings, multi-behavior, reviews).
Theoretical investigation of scaling laws and Bayes-optimality beyond first-order Markov processes (Xu et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Markovian Pre-Trained Transformer for Next-Item Recommendation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Markovian Pre-trained Transformer (MPT).