Papers
Topics
Authors
Recent
Search
2000 character limit reached

Markovian Pre-trained Transformer (MPT)

Updated 14 January 2026
  • MPT is a universal sequential recommendation model that pre-trains on synthetic Markov chains to emphasize the most recent user interactions.
  • The architecture combines a Transformer backbone with a lightweight adaptor, enabling effective adaptation to diverse recommendation datasets.
  • Empirical results show significant accuracy gains over traditional models, validating the method's scalability and efficiency.

The Markovian Pre-trained Transformer (MPT) is a universal, transferable recommendation model that achieves state-of-the-art next-item recommendation performance through a novel pre-training regime on synthetic Markov chains and the use of a lightweight adaptor. MPT exploits the empirical finding that advanced sequential recommenders primarily utilize the most recent interaction—reflecting a Markovian, first-order dependency—while earlier interactions serve as non-sequential signals for inferring user identity. By pre-training a Transformer backbone on synthetic Markov chain data and adapting with minimal parameters, MPT combines scalability, universality, and competitive accuracy for sequential recommendation tasks (Xu et al., 13 Jan 2026).

1. Markovian Principle in Sequential Recommendation

A crucial empirical insight guiding MPT is the “Markovian” nature observed across advanced sequential recommendation models. Given a user interaction sequence [v1, v2,,vt][v_1,\ v_2,\dots,v_t], typical models estimate Pr(vt+1vt,vt1,,v1;Θ)\Pr(v_{t+1}\mid v_t, v_{t-1},\dots,v_1;\Theta). However, experiments demonstrate that shuffling all but the last interaction {v1,,vt1}\{v_1,\dots,v_{t-1}\} hardly degrades next-item prediction accuracy; in contrast, shuffling the entire sequence reduces performance, but the result still relies disproportionately on the final item.

This empirical property is formalized as:

Pr(vt+1vt,vt1,,v1)Pr(vt+1vt,{v1,,vt1}),\Pr(v_{t+1}\mid v_t,v_{t-1},\ldots,v_1) \approx \Pr\bigl(v_{t+1}\mid v_t,\{v_1,\dots,v_{t-1}\}\bigr),

i.e., next-item prediction is nearly first-order Markovian, where vtv_t captures short-term intent and {v1,,vt1}\{v_1,\dots,v_{t-1}\} aggregates into a user profile. Thus, sequence modeling in recommendation decomposes into two capabilities:

  1. Summarizing general user preferences from historical interactions (profile).
  2. Emphasizing the last interaction for immediate next-item prediction (Xu et al., 13 Jan 2026).

2. Model Architecture and Pre-training

2.1 Transformer Backbone

The MPT employs a standard Transformer architecture:

Attention(Q,K,V)=softmax(QKdk)V\operatorname{Attention}(Q,K,V) = \operatorname{softmax}\Bigl(\frac{QK^\top}{\sqrt{d_k}}\Bigr)V

used in LL stacked layers with hh attention heads.

  • Feed-Forward Network (FFN): Per-layer FFN comprised of two linear projections and a nonlinearity (e.g., GeLU).
  • Residual/RMSNorm: Each block is equipped with RMSNorm and residual connections.
  • Typical hyperparameters: L=4L = 4, d=256d = 256, h=2h = 2 (Xu et al., 13 Jan 2026).

2.2 Markov Chain Pre-training

MPT is pre-trained exclusively on synthetic Markov chains:

  • Trajectory Generation: Infinite-length trajectories (s1,,sT)(s_1,\dots,s_T) are generated via random S×S|\mathcal S|\times|\mathcal S| transition matrices P\mathbf P; each row pi\mathbf p_i follows Dir(α,,α)\text{Dir}(\alpha,\dots,\alpha) with α0.05\alpha\approx 0.05 for sparsity.
  • Trajectory length: Fixed T=1024T=1024.
  • Input diversity: Each trajectory employs a fresh random orthogonal projection from one-hot states to vector inputs.
  • Objective: Next-State Prediction (NSP),

LNSP(Θ)=EPDir(α)  E(st)P[t=1T1logPr(st+1st,,s1;Θ)],\mathcal{L}_{\mathrm{NSP}}(\Theta) = \mathbb{E}_{\mathbf P\sim\mathrm{Dir}(\alpha)}\; \mathbb{E}_{(s_t)\sim \mathbf P} \left[-\sum_{t=1}^{T-1} \log \Pr(s_{t+1}|s_t,\dots,s_1;\Theta)\right],

effectively enforcing the model to estimate state transition probabilities and to sharply attend to the current state.

Pre-training on synthetic Markov data offers architectural universality and exposes the model to unlimited, controlled trajectories (Xu et al., 13 Jan 2026).

3. Fine-tuning via Lightweight Adaptor

3.1 Adaptor Module

For alignment with recommendation tasks and datasets, MPT introduces a lightweight input adaptor while keeping the pre-trained backbone frozen. The adaptor, inserted at each time step, consists of:

  • RMSNorm
  • Linear \rightarrow LeakyReLU
  • Linear

This adaptor maps item semantic features (e.g., text embeddings) into the token space of MPT. The additional parameter count ϕ\phi is only a few percent of the frozen parameter set Θ\Theta (Xu et al., 13 Jan 2026).

3.2 Fine-tuning and Prediction

Fine-tuning maximizes the likelihood of the next item:

LNIP(Θ,ϕ)=(u,v1,,vt)DreclogPr(vt+1vt,,v1,u;Θ,ϕ)\mathcal{L}_{\mathrm{NIP}}(\Theta,\phi) = -\sum_{(u, v_1, \dots, v_t)\in\mathcal{D}_\mathrm{rec}} \log \Pr\left(v_{t+1}\mid v_t,\dots,v_1,u; \Theta,\phi\right)

where Pr()\Pr(\cdot) is implemented as a cosine similarity with a learnable temperature. With Θ\Theta frozen, only ϕ\phi is trained, encouraging universal adaptation to varying item spaces while retaining generalizable sequence summarization and last-item emphasis (Xu et al., 13 Jan 2026).

4. Synthetic Data Generation

MPT’s pre-training regime leverages the following synthetic Markov chain configuration:

  • State space: S30|\mathcal S|\approx 30
  • Dirichlet α\alpha: $0.05$ to induce sparse transitions
  • Trajectory Length: T=1024T=1024
  • Input transformation: New random orthogonal projections per trajectory to prevent memorization
  • Scale: On-the-fly trajectory generation for approximately 10 billion tokens.

This pipeline enables unlimited, diverse trajectories, controlling transition sparsity and state-space cardinality to match real-world recommendation environments (Xu et al., 13 Jan 2026).

5. Empirical Performance and Analysis

5.1 Datasets and Evaluation Metrics

MPT is evaluated on five public datasets:

Dataset Users Items Interactions Avg. Length
Beauty 22,363 12,101 198,502 8.9
Toys 19,412 11,924 167,597 8.6
Sports 35,598 18,357 296,337 8.3
Yelp 77,277 45,638 2,103,896 27.2
Online Retail 16,520 3,469 519,906 31.5

Metrics: HR@KK, NDCG@KK (KK = 10, 20) (Xu et al., 13 Jan 2026).

5.2 Baselines and Results

Benchmarked methods:

  • From-scratch: GRU4Rec, SASRec, FMLPRec, HSTU, SASRec+
  • Pre-trained recommenders: UniSRec, RecFormer
  • LLMs: E4SRec (LLaMA-2-based), Qwen2.5-7B

MPT+Adaptor achieves, on average,

  • 12%\approx 12\% relative gain over the strongest from-scratch model (SASRec+)
  • 35%\approx 35\% gain over the best pre-trained baseline

For instance, Beauty (NDCG@10):

Model NDCG@10
SASRec+ 0.0591
UniSRec 0.0382
MPT+Adaptor 0.0673

(+13.8% vs. SASRec+, +43.3% vs. UniSRec). Comparable gains are reported across Toys, Sports, Yelp, and Online Retail (Xu et al., 13 Jan 2026).

6. Ablations and Behavioral Analysis

Ablation experiments support the centrality of Markovian pre-training:

  • Pre-training effect: Removing pre-training reduces the model to SASRec+, while including it yields consistent $10$–15%15\% improvements.
  • Fine-tuning: Adaptor-only tuning suffices in most datasets. LoRA-based strategies may overfit, except on Yelp where LoRA offers marginal benefit.
  • Synthetic sequence parameters: T1024T\approx 1024 (trajectory length) and S30|\mathcal S|\approx30 (state size) are empirically optimal. Sparsity (α=0.05\alpha=0.05) matches real-world data.
  • Attention analysis: Lower layers of MPT maintain sharp diagonal attention—prioritizing the most recent interaction, in contrast to pre-trained LLMs, which produce diffuse, less effective attention for recommendation (Xu et al., 13 Jan 2026).

7. Contributions, Limitations, and Prospects

MPT’s primary contributions:

  • Empirical demonstration that advanced sequential recommenders display first-order, Markovian behavior.
  • Introduction of MPT: a Transformer pre-trained solely on synthetic Markov chains, yielding universal sequence summarization and robust short-term preference modeling.
  • State-of-the-art performance with minimal adaptation, exceeding both specialist recommenders and large LLM-based systems at lower cost.

Noted limitations include:

  • Restriction to first-order dynamics and absence of higher-order or side-channel information in pre-training.
  • Occasional requirement for more flexible parameter-efficient adaptation (e.g., LoRA) in complex domains.
  • Underfitting in domains with highly specialized or extreme user behaviors.

Future research directions include:

  • Augmenting synthetic data to reflect higher-order dependencies, rich side-info, or variable Markov orders.
  • Synthetic user simulation via LLM-based agents for more realistic data.
  • Extending pre-training to multi-task collaborative settings (ratings, multi-behavior, reviews).
  • Theoretical investigation of scaling laws and Bayes-optimality beyond first-order Markov processes (Xu et al., 13 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Markovian Pre-trained Transformer (MPT).