360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation

Published 27 Jan 2025 in cs.IR and cs.AI | (2501.16450v3)

Abstract: Ranking and recommendation systems are the foundation for numerous online experiences, ranging from search results to personalized content delivery. These systems have evolved into complex, multilayered architectures that leverage vast datasets and often incorporate thousands of predictive models. The maintenance and enhancement of these models is a labor intensive process that requires extensive feature engineering. This approach not only exacerbates technical debt but also hampers innovation in extending these systems to emerging problem domains. In this report, we present our research to address these challenges by utilizing a large foundation model with a textual interface for ranking and recommendation tasks. We illustrate several key advantages of our approach: (1) a single model can manage multiple predictive tasks involved in ranking and recommendation, (2) decoder models with textual interface due to their comprehension of reasoning capabilities, can generalize to new recommendation surfaces and out-of-domain problems, and (3) by employing natural language interfaces for task definitions and verbalizing member behaviors and their social connections, we eliminate the need for feature engineering and the maintenance of complex directed acyclic graphs of model dependencies. We introduce our research pre-production model, 360Brew V1.0, a 150B parameter, decoder-only model that has been trained and fine-tuned on LinkedIn's data and tasks. This model is capable of solving over 30 predictive tasks across various segments of the LinkedIn platform, achieving performance levels comparable to or exceeding those of current production systems based on offline metrics, without task-specific fine-tuning. Notably, each of these tasks is conventionally addressed by dedicated models that have been developed and maintained over multiple years by teams of a similar or larger size than our own.

Abstract PDF Upgrade to Chat

Summary

The paper presents a 150B parameter decoder-only model, 360Brew V1.0, that replaces complex feature engineering with a unified textual interface.
The methodology employs large-scale pre-training on diversified LinkedIn data to support over 30 predictive ranking and recommendation tasks without task-specific fine-tuning.
The study demonstrates improved handling of cold-start issues and interaction dynamics, complemented by scalable inference through optimized distributed training frameworks.

360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation

The paper "360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation" (2501.16450) introduces a novel approach to improving the performance of recommendation systems by utilizing a large-scale decoder-only LLM. This model, named 360Brew V1.0, is specifically designed for the LinkedIn platform to tackle a wide variety of predictive tasks related to ranking and recommendation. Unlike existing systems that predominantly employ ID-based models and complex feature engineering, 360Brew leverages a unified model architecture that uses natural language as a comprehensive interface for defining tasks, thus reducing the need for extensive feature engineering and complex model dependencies. *Figure 1: Overview of A Recommendation System (RS) %{Shall we make this fig a bit smaller? just asking}. *

Construction Challenges in Traditional Recommender Systems

The existing recommendation systems—comprising of the retrieval, ranking, and blending layers—are often based on ID and handcrafted features. These allow model specialization but lead to several challenges:

Cold-start Problem: Traditional systems tend to struggle with new items or members due to their reliance on specific feature embeddings.
Interaction Dynamics: With diverse user actions requiring complex feature engineering, traditional models were constrained in efficiently generalizing across different surfaces and tasks.

The paper discusses the use of a 150B parameter decoder-only foundation model that is trained and fine-tuned on LinkedIn’s proprietary data. This approach is said to support over 30 predictive tasks without task-specific fine-tuning, achieving performances that can match or exceed existing production systems based on offline metrics.

Figure 2: Performance of model compared to baselines as the test data gets temporally farther from the training data.

Decentralization via Decoder-Only LLMs

The paper proposes utilizing a large LLM architecture with a textual interface to capitalize on its comprehension and reasoning capabilities, allowing the model to perform manual feature engineering implicitly through contextual learning, previously a burden in traditional RS models.

Recent studies have demonstrated that LLMs can integrate (e.g., member profiles and content descriptions) in a textual interface, showcasing promising resolution to the cold-start problem (Figure 3).

Figure 3: Performance on 4 T2 tasks across 4 surfaces which were not part of the training.

The approach challenges the traditional reliance on with ID-based features by replacing them with centralized prompt engineering, using a deep, multilayer transformer model architecture that is adaptive to changing data and new tasks.

Continuous Pre-training via Mixtral 8x22B Model

The foundation model, termed V1.0, effectively leverages the Mixtral 8x22B MoE~(Esmaeili et al., 2023) architecture. This model accommodates more complex relationships, thus avoiding the need for vast numbers of manually crafted features, a common bottleneck in traditional RS.

Figure 4 in the original paper illustrates the improvement in model performance with an increase in the model's parameters.

Figure 4: Effects of model size (in parameters) on the performance across different tasks.

Data-centric Approach and Pre-training

The data used for pre-training on the LinkedIn Economic Graph involves a diverse range of interactions, profiles, job descriptions, and network data spanning 3-6 months and engaging approximately 45 million active monthly members.

Data diversification, especially through stratified and importance sampling, plays a significant role in enhancing model generalization. Through continuous pre-training by scaling the token count, 360Brew effectively closes the gap to production models on T1 tasks, as demonstrated in (Figure 5).

Figure 5: Closing the gap to production models on T1 tasks by scaling the token count during continuous pre-training.

Infrastructure Scaling and Technical Challenges

To cope with the computational demands of a 150B parameter model, the paper employs PyTorch Lightning and the PyTorch-native FSDP for distributed training. Efficient handling of large models was enabled by optimal checkpointing to ensure minimal disruption during the training loop. The Lightning Fully Sharded Data Parallel strategy was used to enhance training parallelism while optimizing resource utilization.

The team evaluated traditional 3D parallelism approaches as well as DeepSpeed ZeRO-2 and ZeRO-3 methods but ultimately chose to use PyTorch-native FSDP due to its compatibility and reduced operational burden. While FSDP was favored for training, vLLM was selected for inference due to its strong community support and the ability to maintain consistent inference across various use cases with efficient KV-caching for autoregressive generation, as shown in (Figure 6).

Figure 6: Effect of context length on batch inference throughput on Nvidia A100. Throughput decreases with increasing context length but not linearly with $\log_2{(\text{context length})}$ .

Despite these advantages, the implementation using vLLM faced challenges related to performance drops due to configuration assumptions and issues with multiprocessing workers in offline scripts. However, community support played a significant role in addressing many of these issues.

Successful Application and Generalization of ICL

The paper explores the power of ICL in creating personalized recommendations. Using a Many-Turn Chat (MTC) template for supervised fine-tuning, the model becomes adept at generating personalized prediction based on historical interactions within given contexts.

Despite the cost of additional CPT, significant improvements are observed in long-context generalization, maintaining competitive performance across both in-domain and out-of-domain tasks without additional fine-tuning. The model leverages its ICL capabilities to optimize the personalization of recommendations through the effective use of historical member interactions as context (Figure 3).

Figure 3: Performance on 4 T2 tasks across 4 surfaces which were not part of training.

Conclusion

The paper presents a hierarchical, decoder-only model architecture, "360Brew V1.0," for personalized recommendation and ranking tasks, effectively addressing the limitations of traditional ID-based RS models. By substituting feature engineering with prompt engineering, 360Brew highlights the advantages of scaling architectures and data on computational resources. The authors outline a comprehensive training pipeline that harnesses the scalable architecture of a 150B parameter, decoder-only model leveraging advanced pre-training techniques and gaze into the potential of such models in diverse recommendation tasks. While challenges in context length generalization and dependency on active community support for inference frameworks are acknowledged, the research indicates promising avenues for the adaptability and scalability of foundation models for RS tasks in the future.

Markdown Report Issue