Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Sequential Recommendation for Long Term User Interest Via Personalization

Published 7 Jan 2026 in cs.IR and cs.AI | (2601.03479v1)

Abstract: Recent years have witnessed success of sequential modeling, generative recommender, and LLM for recommendation. Though the scaling law has been validated for sequential models, it showed inefficiency in computational capacity when considering real-world applications like recommendation, due to the non-linear(quadratic) increasing nature of the transformer model. To improve the efficiency of the sequential model, we introduced a novel approach to sequential recommendation that leverages personalization techniques to enhance efficiency and performance. Our method compresses long user interaction histories into learnable tokens, which are then combined with recent interactions to generate recommendations. This approach significantly reduces computational costs while maintaining high recommendation accuracy. Our method could be applied to existing transformer based recommendation models, e.g., HSTU and HLLM. Extensive experiments on multiple sequential models demonstrate its versatility and effectiveness. Source code is available at \href{https://github.com/facebookresearch/PerSRec}{https://github.com/facebookresearch/PerSRec}.

Summary

  • The paper proposes a method that compresses long user histories via personalized learnable tokens, reducing inference cost while maintaining accuracy.
  • It employs segment-wise token generation and a modified causal attention mask to integrate historical data with recent interactions efficiently.
  • Empirical results show comparable recall performance to full-sequence models with inference speed improvements exceeding 75% on large datasets.

Efficient Long-Term Sequential Recommendation via Personalized Experts

Introduction

The challenge of efficiently incorporating long user interaction histories into recommender systems is acute for large-scale applications. Traditional transformer-based sequential recommenders such as HSTU and HLLM have empirically demonstrated that extending sequence length improves recommendation performance. However, the quadratic computational complexity with respect to sequence length presents significant scalability barriers. The paper "Efficient Sequential Recommendation for Long Term User Interest Via Personalization" (2601.03479) proposes a principled approach to decouple sequence length from inference-time cost by compressing historical user interactions into personalized learnable tokens (termed 'personalized experts'). These compressed representations are then combined with short recent contexts for efficient high-quality sequential recommendation.

Scaling Sequential Recommendation Models

Empirical evidence presented establishes that recall metrics for both HSTU and HLLM scale positively with sequence length on the MerRec dataset. Figure 1

Figure 1: Performance (Recall@5) of HSTU and HLLM models increases as the user interaction sequence length grows, highlighting the importance of long-term history modeling.

Despite this, traditional approaches such as sequence truncation or clustering-based two-stage pipelines (SIM, TWIN, KuaiFormer) suffer from either computation-accuracy tradeoffs or lack of modeling alignment. The proposed method addresses these by maintaining fully end-to-end training and integrating compression operations directly into the model's training and inference procedure.

Methodology

The core proposal is to divide the user's interaction history into manageable segments and compress each segment into a (set of) learnable token(s) using a dedicated segment decoder, which may share parameters with the main model. Figure 2

Figure 2: Architecture overview; long sequences are divided into segments, each compressed into segment embeddings (personalized experts) that, concatenated with recent segment item embeddings, enable efficient sequential recommendation.

By appending learnable tokens at segment boundaries and manipulating attention masks, the approach restricts information flow, ensuring tokens only attend to items within the same segment (and previous segment experts), thus enforcing localized compression. Figure 3

Figure 3: Attention structure for training—each position may attend only to its segment's past and the accumulated learnable tokens, avoiding information leak and ensuring segment-local compression.

This is implemented via a modified causal attention mask that disconnects segments except via their compressed tokens. Figure 4

Figure 4

Figure 4: Left: Standard causal attention used for next-item prediction. Right: Segment-wise mask restricting cross-segment attention to pass only through compressed tokens.

Training is performed with a standard next-item prediction loss, omitting the learnable token positions from the loss computation.

Inference uses a progressive scheme: compress past segments once (caching the activations of learnable tokens—effectively a segment-wise KV cache) and use these, together with recent uncompressed events, for downstream predictions. Figure 5

Figure 5: Inference procedure—learnable token activations are cached per segment and used as context for new item prediction, minimizing recomputation.

Computational Complexity

Standard transformer inference cost is O(n2d)\mathcal{O}(n^2 d) per layer for sequence length nn and hidden size dd. The proposed method reduces the quadratic term by amortizing compression. For nn items split into mm segments with kk total experts, the total computation for inference is approximately:

O((n+k)2md)\mathcal{O}\left(\frac{(n+k)^2}{m} d \right)

with knk \ll n, achieving a significant reduction in wall-clock latency and resource usage, especially when segment compression is reused for multiple predictions. Figure 6

Figure 6: Decoder layer structure demonstrating how cached keys and values (from learnable tokens) are concatenated to contextualize new input at inference.

Empirical Results

Experiments on MerRec and EB-NeRD datasets show that compressing large history segments into a small set of personalized experts, combined with a short "recent" uncompressed context, yields retrieval performance close to (or matching) full-sequence baselines, with dramatic inference speed gains. Figure 7

Figure 7: Experimental protocol for temporal distance analysis—performance as recent segment slides farther from the pretrain (expert-generating) segment.

Figure 8

Figure 8: Retrieval performance (Recall@5) versus the temporal gap between recent segment and expert segment; compressed models maintain performance close to full-history baselines regardless of the offset.

Table-based analysis additionally shows:

  • Training-time overhead is negligible (<<5%);
  • Inference-time savings can exceed 75% versus processing full-length sequences;
  • The method is robust to the number of experts (except at pathological settings);
  • Placement strategy for learnable tokens matters, with all experts placed after a single long pretrain segment performing better than more granular splits.

Analysis of Token Compression

Interpretability investigations show that the learned expert tokens capture salient aspects of historical behavior. Non-negative matrix factorization (NMF) of token activations reveals they are typically linearly spanned by a sparse set of relevant history items, matching the target's semantic characteristics. Figure 9

Figure 9: NMF-based visualization demonstrating that learnable experts encode history information closely related to the target, as measured by top-weighted items.

Practical and Theoretical Implications

This work unifies key LLM-based context compression ideas (e.g., SepLLM (Chen et al., 2024), Gist (Mu et al., 2023), AutoCompressor (Chevalier et al., 2023)) with the specific requirements of large-scale recommendation. The approach retains the benefits of long-term history modeling while making sequential models tractable for real-world deployment—an essential property for recommender systems at scale. Notably, the personalized expert framework provides a general mechanism for fixed-cost historical summarization, applicable to both ID-based and content-based architectures.

This also offers a path for integrating pre-computed representations across multimodal or multi-behavior user traces, given the segment-agnostic nature of expert token generation.

Conclusion

The proposed personalized expert compression paradigm enables scalable sequential recommendation, maintaining accuracy comparable to full-sequence models with significantly reduced inference cost. Analysis demonstrates robustness with respect to expert number and placement, and interpretability evidence supports that compressed tokens retain relevant historical signal. This method, being generic, could lay the groundwork for a new class of scalable, personalized, and efficient recommendation systems and contextual LLM architectures for user modeling.

Future work may explore joint multi-behavior or multimodal compression, expert adaptation for continual learning, and extending the caching paradigm to hierarchical contexts for further scalability advances.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 33 likes about this paper.