Markovian Pre-trained Transformer (MPT)
- MPT is a universal sequential recommendation model that pre-trains on synthetic Markov chains to emphasize the most recent user interactions.
- The architecture combines a Transformer backbone with a lightweight adaptor, enabling effective adaptation to diverse recommendation datasets.
- Empirical results show significant accuracy gains over traditional models, validating the method's scalability and efficiency.
The Markovian Pre-trained Transformer (MPT) is a universal, transferable recommendation model that achieves state-of-the-art next-item recommendation performance through a novel pre-training regime on synthetic Markov chains and the use of a lightweight adaptor. MPT exploits the empirical finding that advanced sequential recommenders primarily utilize the most recent interaction—reflecting a Markovian, first-order dependency—while earlier interactions serve as non-sequential signals for inferring user identity. By pre-training a Transformer backbone on synthetic Markov chain data and adapting with minimal parameters, MPT combines scalability, universality, and competitive accuracy for sequential recommendation tasks (Xu et al., 13 Jan 2026).
1. Markovian Principle in Sequential Recommendation
A crucial empirical insight guiding MPT is the “Markovian” nature observed across advanced sequential recommendation models. Given a user interaction sequence , typical models estimate . However, experiments demonstrate that shuffling all but the last interaction hardly degrades next-item prediction accuracy; in contrast, shuffling the entire sequence reduces performance, but the result still relies disproportionately on the final item.
This empirical property is formalized as:
i.e., next-item prediction is nearly first-order Markovian, where captures short-term intent and aggregates into a user profile. Thus, sequence modeling in recommendation decomposes into two capabilities:
- Summarizing general user preferences from historical interactions (profile).
- Emphasizing the last interaction for immediate next-item prediction (Xu et al., 13 Jan 2026).
2. Model Architecture and Pre-training
2.1 Transformer Backbone
The MPT employs a standard Transformer architecture:
- Input Encoding: Each state is projected into a continuous embedding space.
- Positional Encoding: Standard absolute or rotary embeddings inject sequence position.
- Multi-Head Self-Attention (MHSA):
used in stacked layers with attention heads.
- Feed-Forward Network (FFN): Per-layer FFN comprised of two linear projections and a nonlinearity (e.g., GeLU).
- Residual/RMSNorm: Each block is equipped with RMSNorm and residual connections.
- Typical hyperparameters: , , (Xu et al., 13 Jan 2026).
2.2 Markov Chain Pre-training
MPT is pre-trained exclusively on synthetic Markov chains:
- Trajectory Generation: Infinite-length trajectories are generated via random transition matrices ; each row follows with for sparsity.
- Trajectory length: Fixed .
- Input diversity: Each trajectory employs a fresh random orthogonal projection from one-hot states to vector inputs.
- Objective: Next-State Prediction (NSP),
effectively enforcing the model to estimate state transition probabilities and to sharply attend to the current state.
Pre-training on synthetic Markov data offers architectural universality and exposes the model to unlimited, controlled trajectories (Xu et al., 13 Jan 2026).
3. Fine-tuning via Lightweight Adaptor
3.1 Adaptor Module
For alignment with recommendation tasks and datasets, MPT introduces a lightweight input adaptor while keeping the pre-trained backbone frozen. The adaptor, inserted at each time step, consists of:
- RMSNorm
- Linear LeakyReLU
- Linear
This adaptor maps item semantic features (e.g., text embeddings) into the token space of MPT. The additional parameter count is only a few percent of the frozen parameter set (Xu et al., 13 Jan 2026).
3.2 Fine-tuning and Prediction
Fine-tuning maximizes the likelihood of the next item:
where is implemented as a cosine similarity with a learnable temperature. With frozen, only is trained, encouraging universal adaptation to varying item spaces while retaining generalizable sequence summarization and last-item emphasis (Xu et al., 13 Jan 2026).
4. Synthetic Data Generation
MPT’s pre-training regime leverages the following synthetic Markov chain configuration:
- State space:
- Dirichlet : $0.05$ to induce sparse transitions
- Trajectory Length:
- Input transformation: New random orthogonal projections per trajectory to prevent memorization
- Scale: On-the-fly trajectory generation for approximately 10 billion tokens.
This pipeline enables unlimited, diverse trajectories, controlling transition sparsity and state-space cardinality to match real-world recommendation environments (Xu et al., 13 Jan 2026).
5. Empirical Performance and Analysis
5.1 Datasets and Evaluation Metrics
MPT is evaluated on five public datasets:
| Dataset | Users | Items | Interactions | Avg. Length |
|---|---|---|---|---|
| Beauty | 22,363 | 12,101 | 198,502 | 8.9 |
| Toys | 19,412 | 11,924 | 167,597 | 8.6 |
| Sports | 35,598 | 18,357 | 296,337 | 8.3 |
| Yelp | 77,277 | 45,638 | 2,103,896 | 27.2 |
| Online Retail | 16,520 | 3,469 | 519,906 | 31.5 |
Metrics: HR@, NDCG@ ( = 10, 20) (Xu et al., 13 Jan 2026).
5.2 Baselines and Results
Benchmarked methods:
- From-scratch: GRU4Rec, SASRec, FMLPRec, HSTU, SASRec+
- Pre-trained recommenders: UniSRec, RecFormer
- LLMs: E4SRec (LLaMA-2-based), Qwen2.5-7B
MPT+Adaptor achieves, on average,
- relative gain over the strongest from-scratch model (SASRec+)
- gain over the best pre-trained baseline
For instance, Beauty (NDCG@10):
| Model | NDCG@10 |
|---|---|
| SASRec+ | 0.0591 |
| UniSRec | 0.0382 |
| MPT+Adaptor | 0.0673 |
(+13.8% vs. SASRec+, +43.3% vs. UniSRec). Comparable gains are reported across Toys, Sports, Yelp, and Online Retail (Xu et al., 13 Jan 2026).
6. Ablations and Behavioral Analysis
Ablation experiments support the centrality of Markovian pre-training:
- Pre-training effect: Removing pre-training reduces the model to SASRec+, while including it yields consistent $10$– improvements.
- Fine-tuning: Adaptor-only tuning suffices in most datasets. LoRA-based strategies may overfit, except on Yelp where LoRA offers marginal benefit.
- Synthetic sequence parameters: (trajectory length) and (state size) are empirically optimal. Sparsity () matches real-world data.
- Attention analysis: Lower layers of MPT maintain sharp diagonal attention—prioritizing the most recent interaction, in contrast to pre-trained LLMs, which produce diffuse, less effective attention for recommendation (Xu et al., 13 Jan 2026).
7. Contributions, Limitations, and Prospects
MPT’s primary contributions:
- Empirical demonstration that advanced sequential recommenders display first-order, Markovian behavior.
- Introduction of MPT: a Transformer pre-trained solely on synthetic Markov chains, yielding universal sequence summarization and robust short-term preference modeling.
- State-of-the-art performance with minimal adaptation, exceeding both specialist recommenders and large LLM-based systems at lower cost.
Noted limitations include:
- Restriction to first-order dynamics and absence of higher-order or side-channel information in pre-training.
- Occasional requirement for more flexible parameter-efficient adaptation (e.g., LoRA) in complex domains.
- Underfitting in domains with highly specialized or extreme user behaviors.
Future research directions include:
- Augmenting synthetic data to reflect higher-order dependencies, rich side-info, or variable Markov orders.
- Synthetic user simulation via LLM-based agents for more realistic data.
- Extending pre-training to multi-task collaborative settings (ratings, multi-behavior, reviews).
- Theoretical investigation of scaling laws and Bayes-optimality beyond first-order Markov processes (Xu et al., 13 Jan 2026).