RecGOAT: Multimodal Recommendation Framework
- RecGOAT is a multimodal recommendation framework that integrates LLM/LVM semantic features with ID-based collaborative filtering using graph neural networks.
- It employs instance-level contrastive learning alongside distribution-level optimal adaptive transport to ensure semantic consistency and comprehensive signal fusion.
- Empirical results on benchmark datasets and industrial deployments confirm its scalability and enhanced recommendation accuracy.
RecGOAT is a multimodal recommendation framework designed to address the intrinsic misalignment between large language/vision model (LLM/LVM) embeddings and traditional collaborative filtering (CF) ID-based embeddings. By integrating high-order collaborative structure via graph neural networks and enforcing dual-granularity alignment through instance-level contrastive learning and distribution-level optimal adaptive transport, RecGOAT produces unified item and user representations exhibiting both semantic consistency and comprehensiveness. This enables large-scale, high-accuracy recommendations in settings where item/user modalities and CF signals are traditionally hard to fuse (Li et al., 31 Jan 2026).
1. Motivation and Problem Formulation
Modern multimodal recommender systems increasingly incorporate LLMs/LVMs to capture world-aware semantics from text and images. However, standard CF pipelines depend on highly sparse ID-based embeddings learned solely from user–item interactions. Simple fusion strategies—concatenation or summation—between LLM-based and ID-based embeddings are fundamentally flawed; the representation spaces are semantically heterogeneous, often resulting in degraded performance and incompatibility. This core representational divergence underlies RecGOAT’s methodology: it does not merely combine signal sources, but explicitly models and aligns their distinct semantic properties (Li et al., 31 Jan 2026).
2. Intra-Modal Graph Attention Architecture
RecGOAT constructs and refines modality-specific and ID-specific item and user graphs via attentive message passing:
- Item–Item Graphs: For both text () and visual () modalities, each item is embedded as . These representations anchor -NN graphs , with edges weighted by cosine similarity . Graph attention networks (GATs) with heads propagate information, generating higher-order features .
- User–Item Graph (LightGCN): A bipartite graph links users () and items (). ID-based embeddings , are iteratively updated via normalized neighborhood aggregation per LightGCN, resulting in and .
- User–User Graph: Each user history is summarized into text, encoded with the text encoder, and used to form a -NN similarity graph among users. GAT is again used to propagate semantically enriched features .
This architecture enables high-order collaborative and multimodal reasoning over both observed interactions and LLM-extracted semantics.
3. Dual-Granularity Semantic Alignment
RecGOAT introduces two complementary alignment schemes:
- Instance-Level Cross-Modal Contrastive Learning (CMCL): Each item is viewed as a triple . For anchor–positive modality pairs, the InfoNCE loss encourages embeddings for the same item to be close, while repelling embeddings corresponding to different items:
summed over all triplet pairs and averaged to form .
- Distribution-Level Optimal Adaptive Transport (OAT): For each modality , OAT computes the entropic optimal transport plan between modality embedding distribution and the ID embedding distribution , using a ground cost matrix of mean distances. The Sinkhorn–Knopp algorithm computes a transport base plan , which is refined by a learnable residual . Fused item embeddings are generated as
where weights are user-tuned.
The dual alignment strategy ensures both semantic consistency (instance-level) and coverage (distribution-level) in the final representation.
4. Theoretical Guarantees
RecGOAT offers formal semantic alignment guarantees under Lipschitz continuity of the user embedding norm and the ground-truth preference function:
- Instance-Level Distance Bound:
- Modality-to-Unified Error Bound:
- Comprehensiveness and Alignment Consistency: By minimizing and the Wasserstein-1 distances , the unified embedding distribution both approximates and outperforms each single modality, formalizing the intuition of “fusion comprehensiveness.”
These results provide non-asymptotic guarantees for downstream recommendation performance, conditional on minimizing the alignment objectives.
5. Empirical Validation
Experiments on three Amazon public benchmarks (Baby, Sports, Electronics) establish state-of-the-art performance in Recall@10 and NDCG@10. For the Electronics dataset:
- RecGOAT achieves Recall@10 = 0.0468 (vs. 0.0419 for IRLLRec, +8.84% relative improvement)
- RecGOAT achieves NDCG@10 = 0.0271 (vs. 0.0248, +6.69%)
Ablation studies demonstrate:
- Naïve LM–ID concatenation or summation can underperform strong ID-only LightGCN baselines, highlighting semantic conflicts.
- CMCL alone restores most of the performance gap; OAT alone is even stronger.
- Full dual alignment (CMCL + OAT) yields maximal gains.
- The fused embedding exhibits “alignment consistency” (performance stability over weight selection) and always outperforms each single modality after alignment (“fusion comprehensiveness”).
6. Scalability and Industrial Deployment
RecGOAT has been deployed at scale in a major online advertising platform, with significant improvements observed in A/B testing for key metrics (e.g., click-through and conversion rates). The Sinkhorn-based OT step is efficiently implemented, confirming practical scalability for industrial recommender systems. Detailed business metrics are not public, but industrial integration attests to RecGOAT’s operational viability (Li et al., 31 Jan 2026).
7. Significance and Broader Impact
RecGOAT bridges the semantic gap between LLM/LVM and CF ID signals in recommender systems, without requiring end-to-end retraining or modality-specific architectures. Its dual-level alignment advances the field of multimodal recommendation by providing theoretically grounded, empirically validated methods that outperform prior approaches. The methodology generalizes to any domain where discrete ID semantics and world-aware representations must be jointly leveraged, and its deployment demonstrates its applicability in high-throughput, production environments.