Papers
Topics
Authors
Recent
Search
2000 character limit reached

FAIR: Focused Attention Is All You Need for Generative Recommendation

Published 12 Dec 2025 in cs.IR | (2512.11254v2)

Abstract: Recently, transformer-based generative recommendation has garnered significant attention for user behavior modeling. However, it often requires discretizing items into multi-code representations (e.g., typically four code tokens or more), which sharply increases the length of the original item sequence. This expansion poses challenges to transformer-based models for modeling user behavior sequences with inherent noises, since they tend to overallocate attention to irrelevant or noisy context. To mitigate this issue, we propose FAIR, the first generative recommendation framework with focused attention, which enhances attention scores to relevant context while suppressing those to irrelevant ones. Specifically, we propose (1) a focused attention mechanism integrated into the standard Transformer, which learns two separate sets of Q and K attention weights and computes their difference as the final attention scores to eliminate attention noise while focusing on relevant contexts; (2) a noise-robustness objective, which encourages the model to maintain stable attention patterns under stochastic perturbations, preventing undesirable shifts toward irrelevant context due to noise; and (3) a mutual information maximization objective, which guides the model to identify contexts that are most informative for next-item prediction. We validate the effectiveness of FAIR on four public benchmarks, demonstrating its superior performance compared to existing methods.

Summary

  • The paper introduces a focused attention mechanism that computes differential attention maps to filter out noisy context and enhance relevant dependencies.
  • The paper employs a noise-robust self-supervision task alongside a contrastive InfoNCE objective, ensuring representation invariance and more predictive feature extraction.
  • The paper demonstrates significant improvements in Recall and NDCG metrics across multiple datasets, validating its effectiveness over standard models.

FAIR: Focused Attention for Generative Sequential Recommendation

Introduction

Generative recommendation frameworks have advanced the modeling of user interaction sequences by leveraging Transformer-based architectures to generate semantic item codes. A persistent challenge in this regime is the expansion of sequence length due to multi-code item discretization, which amplifies the susceptibility of standard Transformers to attention diffusion and allocation to irrelevant or noisy context. This noise degrades modeling efficiency and impairs next-item prediction accuracy.

The paper introduces FAIR (Focused Attention Is All You Need for Generative Recommendation) (2512.11254), a non-autoregressive generative recommendation framework explicitly designed to address attention noise. FAIR introduces a focused attention mechanism, a noise-robustness self-supervision task, and a mutual information maximization objective, thereby enforcing the model to prioritize informative contextual dependencies and suppress noise in behavioral sequences. Figure 1

Figure 1: Standard Transformers over-allocate attention to irrelevant context, while FAIR sharpens focus on relevant items and suppresses noise.

Methodology

Focused Attention Mechanism

FAIR departs from classical self-attention by introducing a dual-branch design wherein two separate sets of query/key matrices yield independent attention maps. The framework computes the difference between these two attention distributions, normalizes the result, and applies this differential attention to the value representations. This approach is formally defined as:

A=Norm(λ1A1−λ2A2)A = \mathrm{Norm}(\lambda_1 A_1 - \lambda_2 A_2)

where A1A_1 and A2A_2 are softmax-normalized attention scores from the two branches, and λ1,λ2\lambda_1, \lambda_2 modulate their importance. The subtraction operation enhances the model’s capability to filter out redundant or spurious correlations while amplifying informative dependencies.

Multi-head extensions are employed, independently applying focused attention in each head and linearly recombining the results for output. Figure 2

Figure 2: Overview of the FAIR architecture with Focused Attention Mechanism, Noise-Robustness Task, and Mutual Information Maximization.

Noise-Robustness Objective

To directly enhance robustness to sequence perturbations that induce attention drift, the training regime includes a self-supervised task. The model generates noisy variants of the input by stochastic masking or random substitution (with probabilities pmaskp_{\text{mask}} and psubp_{\text{sub}}). A triplet loss brings the hidden representations of the clean and noisy sequences closer while pushing them away from negative batch samples.

This objective enforces representation invariance to input corruption, counteracting Transformer tendencies to shift attention under input noise.

Mutual Information Maximization

FAIR further introduces a contrastive InfoNCE objective to maximize the mutual information between pooled contextual embeddings and their associated target representations. Positive sample similarity is reinforced, and negatives are discouraged, effectively constraining the model to extract features maximally predictive of the next item. This guides attention towards statistically informative context while diminishing irrelevant attention allocation.

Model Training

The final loss function is a weighted aggregation:

L=LMTP+αLNR+βLMIM\mathcal{L} = \mathcal{L}_{\text{MTP}} + \alpha \mathcal{L}_{\text{NR}} + \beta \mathcal{L}_{\text{MIM}}

with LMTP\mathcal{L}_{\text{MTP}} for parallel multi-token prediction, and α\alpha, β\beta as loss balancing coefficients.

Experimental Results

Benchmarks and Baselines

Experiments span four Amazon datasets with varying scale and sparsity. FAIR is evaluated against classical sequential models (e.g., SASRec, GRU4Rec, BERT4Rec), semantic-enhanced Transformers, and state-of-the-art generative methods (e.g., VQ-Rec, TIGER, RPG, HSTU).

Quantitative Results

FAIR achieves consistent empirical gains over all baselines on Recall@5/10 and NDCG@5/10 across datasets. For instance, on the "Toys" dataset, FAIR yields a 13.2% improvement in Recall@5 and an 11% improvement in NDCG@5 compared to the strongest competitor. All improvements are statistically significant (p<0.05p < 0.05).

Ablation and Component Analysis

Ablations demonstrate that each architectural and objective component—focused attention mechanism (FAM), noise-robustness task (NRT), and mutual information maximization (MIM)—contributes to the overall performance. Removal of any single element leads to significant degradation, confirming that mitigating attention noise requires joint architectural and objective-level intervention.

Hyperparameter Sensitivity

Sensitivity studies show monotonic improvement with increasing code sequence length LL and embedding dimension dd up to dataset-dependent thresholds, after which performance plateaus or degrades, indicating the effect of overcapacity and optimization complexity. Figure 3

Figure 3: Performance increases with code sequence length LL up to a threshold, after which overfitting or redundancy emerges.

Figure 4

Figure 4: Model performance with increasing embedding dimension dd illustrates a saturation point balancing expressiveness and generalization.

The loss coefficients α\alpha and β\beta and regularization parameters reveal that FAIR maintains robust performance across a range of values. Unlike standard Transformers, the explicit robustness modeling reduces reliance on generic regularizers like dropout.

Qualitative Analysis

Case studies visualize attention allocation across ablation settings, confirming that the full FAIR model effectively suppresses irrelevant historical items and promotes sharp attention to decision-relevant context. The focused attention mechanism initially shifts attention towards relevant items but requires the auxiliary objectives (NRT and MIM) to enforce meaningful, robust, and predictive allocation.

Implications and Future Directions

The research demonstrates that Transformer-based generative recommendation can be considerably enhanced by explicitly addressing context noise at the architectural and objective level. The differential attention mechanism is theoretically motivated and empirically validated to directly counteract attention diffusion in long discrete code sequences typical of advanced generative recommenders.

From a practical standpoint, focused attention enables more effective scaling of item sequence discretization and supports deployment for industrial-scale catalogs. The self-supervised robustness and mutual information objectives reduce the need for post-hoc or architectural regularization strategies.

The approach also suggests broader implications: the differential or subtractive attention paradigm may generalize to other domains suffering from attention instability under long-range sequence expansion—potentially informing LLMs, multimodal representation learning, and robust sequential inference more generally.

Conclusion

FAIR establishes a robust and semantically principled framework for generative sequential recommendation via focused attention, robustness to input perturbations, and representation informativeness optimization. The method attains strong empirical results with moderate computational overhead, offering a scalable strategy for addressing noisy sequence modeling in generative recommenders and motivating advances in focused inductive bias for self-attention architectures (2512.11254).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.