Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation

Published 14 Feb 2024 in cs.IR and cs.AI | (2402.08921v1)

Abstract: Session-based recommendation has gained increasing attention in recent years, with its aim to offer tailored suggestions based on users' historical behaviors within sessions. To advance this field, a variety of methods have been developed, with ID-based approaches typically demonstrating promising performance. However, these methods often face challenges with long-tail items and overlook other rich forms of information, notably valuable textual semantic information. To integrate text information, various methods have been introduced, mostly following a naive fusion framework. Surprisingly, we observe that fusing these two modalities does not consistently outperform the best single modality by following the naive fusion framework. Further investigation reveals an potential imbalance issue in naive fusion, where the ID dominates and text modality is undertrained. This suggests that the unexpected observation may stem from naive fusion's failure to effectively balance the two modalities, often over-relying on the stronger ID modality. This insight suggests that naive fusion might not be as effective in combining ID and text as previously expected. To address this, we propose a novel alternative training strategy AlterRec. It separates the training of ID and text, thereby avoiding the imbalance issue seen in naive fusion. Additionally, AlterRec designs a novel strategy to facilitate the interaction between the two modalities, enabling them to mutually learn from each other and integrate the text more effectively. Comprehensive experiments demonstrate the effectiveness of AlterRec in session-based recommendation. The implementation is available at https://github.com/Juanhui28/AlterRec.

Abstract PDF HTML Upgrade to Chat

References (35)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces AlterRec, an alternative training strategy that alternates the optimization of ID and text networks to overcome gradient imbalance.
It leverages mutual hard negative mining and positive augmentation from cross-modal predictions to enhance retrieval accuracy, particularly for long-tail items.
Experimental results demonstrate up to 19% gains in Hits@ metrics across datasets like Amazon and Homedepot, underscoring the framework's robust performance.

Alternative Training for Effective ID and Text Fusion in Session-Based Recommendation

Introduction

The integration of ID-based and text-based representations in session-based recommendation has traditionally relied on naive fusion mechanisms, where embeddings from each modality are simply merged and jointly trained. However, empirical results demonstrate that these naive approaches are often suboptimal: fusion does not consistently outperform ID-only models and can even lead to a degradation in performance. This issue primarily stems from gradient imbalance, where ID embeddings dominate optimization while the text modality remains undertrained. The paper "Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation" (2402.08921) proposes and investigates AlterRec, an alternative training strategy where ID and text uni-modal networks are separately optimized in an alternating fashion, leveraging information from each other through prediction-driven sample mining and mutual hard negative selection. The following essay explores the technical contributions, experimental findings, and theoretical implications of this work.

Analysis of Naive Fusion and Imbalance Phenomena

The naive fusion paradigm is prevalent in recent multimodal sequential recommendation literature. Encoders produce representations from IDs and text, which are then merged by sum or concatenation and passed through subsequent layers for prediction and loss computation.

Figure 1: An illustration of a naive fusion framework.

Empirical evaluation reveals two consistent properties:

No systematic improvement from fusion: Across benchmarks (e.g., Amazon-French), combining ID and text embeddings via naive fusion rarely outperforms the individually optimized ID-only model.

Figure 2: Session-based recommendation results (%) on the Amazon-French dataset show naive fusion does not consistently outperform ID-only models.

ID modality dominates optimization: Detailed trajectory analysis of test Hits@20 and training loss shows the fused model essentially tracks the ID component, confirming that the text branch is heavily undertrained.

Figure 3: Test performance and training loss reveal the ID component dominates fused models on Amazon-French.

This behavior arises from modality dominance and optimization competition, and aligns with observations in recent multi-modal learning theory, where the modality with lower representational quality or weaker signal receives vanishing gradient updates, causing it to become stagnant while the stronger branch drives the predictions.

The AlterRec Framework

AlterRec mitigates these pathologies using an alternating training architecture. It decomposes the recommendation model into two independent uni-modal networks: one operating purely on IDs and the other on text. Crucially, each uni-modal network is iteratively trained not only with session data but also with hard negatives and positive augmentations mined from the other network’s own prediction distribution, encouraging each branch to assimilate information from its counterpart.

Figure 4: An overview of AlterRec: separate ID and text networks learn from the other's predictions via alternative training with mutual hard sample mining.

Training alternates between optimizing the ID and text branches:

When optimizing the text network, hard negatives are sampled from top non-true predictions of the ID network, with positive augmentation from the ID network’s top scoring candidates.
Conversely, ID network optimization leverages hard negatives and augmented positives suggested by the text predictor.
The final output is a weighted combination of both scores: $y_{\mathbf{s},i} = \alpha y^{ID}_{\mathbf{s},i} + (1-\alpha) y^{text}_{\mathbf{s},i}$ .

This structure ensures both branches remain actively trained with complementary signal and prevents either modality from collapsing to underuse.

Experimental Results and Empirical Validation

Overall Performance

AlterRec and its augmented variant are evaluated on Homedepot and Amazon-M2 multilingual session datasets. Benchmarks include UniSRec, FDSA, LLM2BERT4Rec, and other state-of-the-art shallow and LLM-based models. On all metrics (Hits@10/20, NDCG@10/20), AlterRec outperforms naive fusion models and achieves considerable gains:

Up to 10% improvement in NDCG@10/20 over UniSRec on Amazon-M2.
Up to 19% and 2% performance gains in Hits@10 and Hits@20 over FDSA on Homedepot and Amazon-M2.

Training Dynamics

Epoch-level analysis shows that, unlike naive fusion, training both the ID and text modalities in AlterRec progresses robustly: both exhibit improvement over time and neither becomes stagnant.

Figure 5: Test performance per epoch in alternative training demonstrates balanced co-optimization of ID and text.

Tail Performance

Text integration is particularly beneficial for long-tail items, where ID-only approaches underperform due to sparse interactions. AlterRec shows significant relative improvement for low-popularity segments:

Figure 6: Long-tail performance analysis highlights AlterRec's advantage on rarely-interacted items.

Ablation and Sensitivity

Hard negative mining is critical; substituting hard negatives with random ones reverts performance to that of independent training and eliminates the observed gains, demonstrating the necessity of mutual information transfer. Sensitivity studies reveal that a balanced combination of ID and text outputs (α ≈ 0.5) is optimal, indicating both modalities contribute non-trivially to final predictions.

Figure 7: Performance as a function of α and $k_2$ demonstrates robustness and the necessity of balancing both modalities.

Robustness Across Datasets

Results on Amazon-French, Homedepot, and the other Amazon locales confirm that the imbalance issue with naive fusion—and the improvements of the AlterRec approach—persist regardless of item vocabulary size, session structure, or underlying text modeling capacity. Ablation (removing either modality) significantly reduces effectiveness, reaffirming the benefit of cross-modal alternation.

Theoretical and Practical Implications

This work provides strong empirical evidence that naive joint training for multi-modal fusion in sequential recommendation is vulnerable to trivial solutions due to optimization imbalance. Mutual hard negative/positive sample mining in alternative training delivers a more robust mechanism for leveraging complementary modalities, particularly where ID signal is sparse (i.e., cold-start and tail-item scenarios).

From a theoretical perspective, the results corroborate recent findings in multimodal optimization showing that imbalance is a dominant failure mode, and that offloading target signals or gradients across modalities is a viable remedy. On the practical side, the framework is modular and can be extended with more expressive text models such as instruction-following LLMs (e.g., LLaMA) or with other modalities, given its alternation-based paradigm requires no particular encoder architecture.

Conclusion

"Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation" exposes critical flaws in naive fusion for sequential recommendation. By employing an alternative training regime that enforces mutual hard negative mining and positive augmentation across ID and text uni-modal networks, AlterRec circumvents imbalance issues, leading to systematically improved retrieval accuracy, especially for long-tail items. The work suggests a path towards more effective and theoretically justified fusion of multiple data modalities in large-scale recommender systems, and future extensions could include the application to more powerful LLMs or adaptation for additional data types.

Markdown Report Issue