Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation

Published 14 Feb 2024 in cs.IR and cs.AI | (2402.08921v1)

Abstract: Session-based recommendation has gained increasing attention in recent years, with its aim to offer tailored suggestions based on users' historical behaviors within sessions. To advance this field, a variety of methods have been developed, with ID-based approaches typically demonstrating promising performance. However, these methods often face challenges with long-tail items and overlook other rich forms of information, notably valuable textual semantic information. To integrate text information, various methods have been introduced, mostly following a naive fusion framework. Surprisingly, we observe that fusing these two modalities does not consistently outperform the best single modality by following the naive fusion framework. Further investigation reveals an potential imbalance issue in naive fusion, where the ID dominates and text modality is undertrained. This suggests that the unexpected observation may stem from naive fusion's failure to effectively balance the two modalities, often over-relying on the stronger ID modality. This insight suggests that naive fusion might not be as effective in combining ID and text as previously expected. To address this, we propose a novel alternative training strategy AlterRec. It separates the training of ID and text, thereby avoiding the imbalance issue seen in naive fusion. Additionally, AlterRec designs a novel strategy to facilitate the interaction between the two modalities, enabling them to mutually learn from each other and integrate the text more effectively. Comprehensive experiments demonstrate the effectiveness of AlterRec in session-based recommendation. The implementation is available at https://github.com/Juanhui28/AlterRec.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  2. Recent developments in multilayer perceptron neural networks. In Proceedings of the seventh annual memphis area engineering and science conference, MAESC. 1–15.
  3. On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 8632–8656.
  4. On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. arXiv preprint arXiv:2305.01233 (2023).
  5. Leveraging Large Language Models for Sequential Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems. 1096–1102.
  6. Session-based Recommendations with Recurrent Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
  7. Core: simple and effective session-based recommendation within consistent representation space. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1796–1801.
  8. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
  9. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International Conference on Machine Learning. PMLR, 9226–9259.
  10. Amazon-m2: A multilingual multi-locale shopping session dataset for recommendation and text generation. arXiv preprint arXiv:2307.09688 (2023).
  11. Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  12. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.
  13. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
  14. MINER: multi-interest matching network for news recommendation. In Findings of the Association for Computational Linguistics: ACL 2022. 343–352.
  15. Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights. arXiv preprint arXiv:2305.11700 (2023).
  16. Multibench: Multiscale benchmarks for multimodal representation learning. arXiv preprint arXiv:2107.07502 (2021).
  17. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  18. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
  19. Heterogeneous global graph neural networks for personalized session-based recommendation. In Proceedings of the fifteenth ACM international conference on web search and data mining. 775–783.
  20. Yoon-Joo Park and Alexander Tuzhilin. 2008. The long tail of recommender systems and how to leverage it. In Proceedings of the 2008 ACM conference on Recommender systems. 11–18.
  21. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8238–8247.
  22. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
  23. Representation Learning with Large Language Models for Recommendation. arXiv preprint arXiv:2310.15950 (2023).
  24. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
  25. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  26. Modeling user preference dynamics with coupled tensor factorization for social media recommendation. Journal of Ambient Intelligence and Humanized Computing 12 (2021), 9693–9712.
  27. Attention is all you need. Advances in neural information processing systems 30 (2017).
  28. What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12695–12705.
  29. LLMRec: Large Language Models with Graph Augmentation for Recommendation. arXiv preprint arXiv:2311.00423 (2023).
  30. Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 6389–6394.
  31. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In International Conference on Machine Learning. PMLR, 24043–24055.
  32. Session-based recommendation with graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 346–353.
  33. Where to go next for recommender systems? id-vs. modality-based recommender models revisited. arXiv preprint arXiv:2303.13835 (2023).
  34. Feature-level Deeper Self-Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
  35. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.
Citations (1)

Summary

  • The paper introduces AlterRec, an alternative training strategy that alternates the optimization of ID and text networks to overcome gradient imbalance.
  • It leverages mutual hard negative mining and positive augmentation from cross-modal predictions to enhance retrieval accuracy, particularly for long-tail items.
  • Experimental results demonstrate up to 19% gains in Hits@ metrics across datasets like Amazon and Homedepot, underscoring the framework's robust performance.

Alternative Training for Effective ID and Text Fusion in Session-Based Recommendation

Introduction

The integration of ID-based and text-based representations in session-based recommendation has traditionally relied on naive fusion mechanisms, where embeddings from each modality are simply merged and jointly trained. However, empirical results demonstrate that these naive approaches are often suboptimal: fusion does not consistently outperform ID-only models and can even lead to a degradation in performance. This issue primarily stems from gradient imbalance, where ID embeddings dominate optimization while the text modality remains undertrained. The paper "Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation" (2402.08921) proposes and investigates AlterRec, an alternative training strategy where ID and text uni-modal networks are separately optimized in an alternating fashion, leveraging information from each other through prediction-driven sample mining and mutual hard negative selection. The following essay explores the technical contributions, experimental findings, and theoretical implications of this work.

Analysis of Naive Fusion and Imbalance Phenomena

The naive fusion paradigm is prevalent in recent multimodal sequential recommendation literature. Encoders produce representations from IDs and text, which are then merged by sum or concatenation and passed through subsequent layers for prediction and loss computation. Figure 1

Figure 1: An illustration of a naive fusion framework.

Empirical evaluation reveals two consistent properties:

  1. No systematic improvement from fusion: Across benchmarks (e.g., Amazon-French), combining ID and text embeddings via naive fusion rarely outperforms the individually optimized ID-only model. Figure 2

Figure 2

Figure 2

Figure 2: Session-based recommendation results (%) on the Amazon-French dataset show naive fusion does not consistently outperform ID-only models.

  1. ID modality dominates optimization: Detailed trajectory analysis of test Hits@20 and training loss shows the fused model essentially tracks the ID component, confirming that the text branch is heavily undertrained. Figure 3

Figure 3

Figure 3: Test performance and training loss reveal the ID component dominates fused models on Amazon-French.

This behavior arises from modality dominance and optimization competition, and aligns with observations in recent multi-modal learning theory, where the modality with lower representational quality or weaker signal receives vanishing gradient updates, causing it to become stagnant while the stronger branch drives the predictions.

The AlterRec Framework

AlterRec mitigates these pathologies using an alternating training architecture. It decomposes the recommendation model into two independent uni-modal networks: one operating purely on IDs and the other on text. Crucially, each uni-modal network is iteratively trained not only with session data but also with hard negatives and positive augmentations mined from the other network’s own prediction distribution, encouraging each branch to assimilate information from its counterpart. Figure 4

Figure 4: An overview of AlterRec: separate ID and text networks learn from the other's predictions via alternative training with mutual hard sample mining.

Training alternates between optimizing the ID and text branches:

  • When optimizing the text network, hard negatives are sampled from top non-true predictions of the ID network, with positive augmentation from the ID network’s top scoring candidates.
  • Conversely, ID network optimization leverages hard negatives and augmented positives suggested by the text predictor.
  • The final output is a weighted combination of both scores: ys,i=αys,iID+(1−α)ys,itexty_{\mathbf{s},i} = \alpha y^{ID}_{\mathbf{s},i} + (1-\alpha) y^{text}_{\mathbf{s},i}.

This structure ensures both branches remain actively trained with complementary signal and prevents either modality from collapsing to underuse.

Experimental Results and Empirical Validation

Overall Performance

AlterRec and its augmented variant are evaluated on Homedepot and Amazon-M2 multilingual session datasets. Benchmarks include UniSRec, FDSA, LLM2BERT4Rec, and other state-of-the-art shallow and LLM-based models. On all metrics (Hits@10/20, NDCG@10/20), AlterRec outperforms naive fusion models and achieves considerable gains:

  • Up to 10% improvement in NDCG@10/20 over UniSRec on Amazon-M2.
  • Up to 19% and 2% performance gains in Hits@10 and Hits@20 over FDSA on Homedepot and Amazon-M2.

Training Dynamics

Epoch-level analysis shows that, unlike naive fusion, training both the ID and text modalities in AlterRec progresses robustly: both exhibit improvement over time and neither becomes stagnant. Figure 5

Figure 5

Figure 5: Test performance per epoch in alternative training demonstrates balanced co-optimization of ID and text.

Tail Performance

Text integration is particularly beneficial for long-tail items, where ID-only approaches underperform due to sparse interactions. AlterRec shows significant relative improvement for low-popularity segments: Figure 6

Figure 6

Figure 6

Figure 6: Long-tail performance analysis highlights AlterRec's advantage on rarely-interacted items.

Ablation and Sensitivity

Hard negative mining is critical; substituting hard negatives with random ones reverts performance to that of independent training and eliminates the observed gains, demonstrating the necessity of mutual information transfer. Sensitivity studies reveal that a balanced combination of ID and text outputs (α ≈ 0.5) is optimal, indicating both modalities contribute non-trivially to final predictions. Figure 7

Figure 7

Figure 7

Figure 7: Performance as a function of α and k2k_2 demonstrates robustness and the necessity of balancing both modalities.

Robustness Across Datasets

Results on Amazon-French, Homedepot, and the other Amazon locales confirm that the imbalance issue with naive fusion—and the improvements of the AlterRec approach—persist regardless of item vocabulary size, session structure, or underlying text modeling capacity. Ablation (removing either modality) significantly reduces effectiveness, reaffirming the benefit of cross-modal alternation.

Theoretical and Practical Implications

This work provides strong empirical evidence that naive joint training for multi-modal fusion in sequential recommendation is vulnerable to trivial solutions due to optimization imbalance. Mutual hard negative/positive sample mining in alternative training delivers a more robust mechanism for leveraging complementary modalities, particularly where ID signal is sparse (i.e., cold-start and tail-item scenarios).

From a theoretical perspective, the results corroborate recent findings in multimodal optimization showing that imbalance is a dominant failure mode, and that offloading target signals or gradients across modalities is a viable remedy. On the practical side, the framework is modular and can be extended with more expressive text models such as instruction-following LLMs (e.g., LLaMA) or with other modalities, given its alternation-based paradigm requires no particular encoder architecture.

Conclusion

"Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation" exposes critical flaws in naive fusion for sequential recommendation. By employing an alternative training regime that enforces mutual hard negative mining and positive augmentation across ID and text uni-modal networks, AlterRec circumvents imbalance issues, leading to systematically improved retrieval accuracy, especially for long-tail items. The work suggests a path towards more effective and theoretically justified fusion of multiple data modalities in large-scale recommender systems, and future extensions could include the application to more powerful LLMs or adaptation for additional data types.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.