Are Large Language Models Really Effective for Training-Free Cold-Start Recommendation?

Published 15 Dec 2025 in cs.IR | (2512.13001v1)

Abstract: Recommender systems usually rely on large-scale interaction data to learn from users' past behaviors and make accurate predictions. However, real-world applications often face situations where no training data is available, such as when launching new services or handling entirely new users. In such cases, conventional approaches cannot be applied. This study focuses on training-free recommendation, where no task-specific training is performed, and particularly on \textit{training-free cold-start recommendation} (TFCSR), the more challenging case where the target user has no interactions. LLMs have recently been explored as a promising solution, and numerous studies have been proposed. As the ability of text embedding models (TEMs) increases, they are increasingly recognized as applicable to training-free recommendation, but no prior work has directly compared LLMs and TEMs under identical conditions. We present the first controlled experiments that systematically evaluate these two approaches in the same setting. The results show that TEMs outperform LLM rerankers, and this trend holds not only in cold-start settings but also in warm-start settings with rich interactions. These findings indicate that direct LLM ranking is not the only viable option, contrary to the commonly shared belief, and TEM-based approaches provide a stronger and more scalable basis for training-free recommendation.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that LLM-supervised TEMs consistently outperform LLM rerankers in accuracy and scalability for cold-start recommendations.
It systematically compares both methods across narrow and broad cold-start scenarios using multiple real-world datasets and statistical metrics.
The study highlights practical benefits of TEMs including lower inference cost, robust cross-domain performance, and effective semantic generalization.

Comparative Analysis of LLMs and TEMs for Training-Free Cold-Start Recommendation

Problem Formulation and Motivation

The study addresses one of the most challenging scenarios in recommender systems: the training-free cold-start recommendation (TFCSR) problem, where the system must generate recommendations for users with no prior interaction data and without any task-specific supervised training. This setting is crucial for deployment in practical environments such as new platform launches or user onboarding phases. The research specifically contrasts two paradigms: (1) direct use of LLMs as rerankers via zero-shot prompting, and (2) dense retrieval through text embedding models (TEMs), where textual user and item representations are encoded into an embedding space and similarity-based ranking is performed.

The novelty of this work lies in its controlled, systematic comparison of state-of-the-art LLM reranking and LLM-supervised TEM methodologies under identical TFCSR and broader cold-start conditions, using real-world datasets and rigorous statistical benchmarks.

Experimental Design

Three datasets were curated for comprehensive evaluation: MovieLens-1M, the Job Recommendation dataset (with variations based on profile richness), and the Amazon Review dataset (split into Music, Movie, and Book domains). The study defined narrow cold-start (profile-only, $m=0$ ) and broad cold-start ( $m>0$ , a small set of interactions available) conditions. Metrics such as Recall@10 and nDCG@10 were measured using fixed candidate pools (typically 50 items) and rigorous user-level statistical testing.

The evaluated methods span:

Sparse baselines: BM25 for non-neural reference.
Dense TEMs: Both generalist and LLM-supervised, including multilingual-e5-large, bge-m3, gte-modernbert-base, gte-Qwen2 models, and Qwen3-Embedding (0.6B/8B).
LLM rerankers: OpenAI’s gpt-4.1, gpt-4.1-mini, and Qwen3-8B, all in inference-only settings with standardized prompts.

The LLM models were required to process the entire candidate set and output top-k rankings, introducing explicit stress on scalability and practical deployment settings.

Main Results

The primary findings can be summarized as follows:

TEMs, especially those trained with LLM supervision, consistently outperformed LLM rerankers in both narrow and broad cold-start settings. In many cases, TEMs such as gte-Qwen2-7B-Instruct and Qwen3-Embedding-8B produced significantly higher Recall@10 and nDCG@10 than LLMs, even as $m$ increased (past work had focused on much higher $m$ or small candidate sets).
LLM reranking was not competitive with strong TEMs. While gpt-4.1 sometimes surpassed sparse retrieval baselines for certain recall values when interactions were available, it was consistently outperformed on normalized DCG and failed to realize gains on realistic candidate set sizes.
Figure 1: Per-user win/loss distributions reveal that TEMs exceeded LLMs for the majority of the population; only a small minority of users exhibited improved LLM performance.
The error analysis at the user granularity indicated that advantages of LLM reranking were restricted to a small segment of the population, with TEMs dominating on both median and aggregate metrics.
When varying the number of user interactions or reducing candidate pool size (from 50 to 10), the superiority of TEMs persisted. LLM rerankers did not demonstrate a crossover point where they exceeded TEMs, meaning context length or interaction sparsity were not primary limiting factors.
Figure 2: Mean Recall@10/nDCG@10 vs. interaction count $m$ confirms that higher $m$ boosts all models, with TEMs achieving and maintaining the leading accuracy.

Figure 3: On reduced candidate sets ( $|L|=10$ ), LLMs and TEMs narrow in recall under extreme sparsity, but overall trends remain unchanged.
In cross-domain cold-start scenarios, where user behaviors stem from a different domain than the target, TEMs maintained their edge over LLM rerankers (see Figure 4).
Figure 4: In cross-domain TFCSR, TEMs again outperform LLM-based reranking across Music-Movie, Movie-Book, and variant settings.
Hybrid user/item representation strategies, such as generative query expansion with LLMs at inference (i.e., creating synthetic queries per user/item and aggregating embeddings), yielded only minor improvements for weak base models and did not challenge the top-line performance of LLM-supervised TEMs.
Figure 5: Relative gains from hybrid or expansion methods over vanilla TEMs are generally minor and sometimes negative for strong base models.

Implications and Limitations

The empirical results yield several authoritative conclusions:

TFCSR requires semantic generalization well beyond what LLM reranking offers, especially when strong domain-agnostic embeddings are available.
Even the most advanced LLMs, when used solely as inference-time rerankers without fine-tuning or large-scale context optimization, do not provide a practical alternative to modern TEMs for cold-start recommendation.
LLM-supervised embedding models (e.g., Qwen3-Embedding-8B) harness the knowledge and reasoning capacity of LLMs more effectively by distilling it into scalable vector spaces, providing both efficiency and accuracy at deployment time.
Query expansion and hybrid approaches offer marginal gains primarily for non-LLM-supervised embedding models and have an upper bound imposed by the quality of the base representations.

From a practical standpoint, TEMs offer lower inference cost, greater scalability (as candidate sets grow), and more favorable hardware throughput compared to LLMs, especially under realistic latency constraints (e.g., user-facing deployments).

On the other hand, there are cautionary notes regarding:

The use of synthetic pretraining data for LLM-supervised TEMs, which may induce domain bias or hallucination artifacts, particularly in niche or long-tail settings.
The need for further research into integration of structured side information (e.g., attribute-value profiles, hierarchical taxonomies) within LLM/TEM architectures to enhance their robustness and interpretability for enterprise-grade cold-start recommendation.

Theoretical and Future Directions

The findings call for a nuanced reassessment of LLMs’ role in TFCSR. Rather than direct reranking, LLMs are best leveraged as teachers in supervision pipelines, where their inductive biases can be distilled for downstream, scalable model variants. This paradigm may extend to reinforcement learning from LLM-based reward signals, dynamic user modelling, or continual learning scenarios where base embeddings are periodically refreshed with supervision from evolving LLMs (domain adaptation).

Furthermore, the extrapolation to supervised and hybrid recommendation architectures is promising. Many older collaborative–content fusion models used classic, non-LLM-based embeddings as initialization for cold-start, then layered in behavioral context as it accumulated. Plugging in powerful, LLM-supervised TEMs as the initialization layer in such architectures could provide a step-function increase in cold-start recommendation quality, both in zero-shot and warm-start phases.

Conclusion

The comparative evaluation unequivocally establishes that TEMs trained with LLM supervision are more effective and more scalable than direct LLM reranking for training-free cold-start recommendation, across both in-domain and cross-domain use cases. While LLMs offer compelling few-shot and reasoning abilities, their utility is maximized as distillers of semantic information into embedding models, rather than as inference-time core recommenders. Future improvements should focus on integrating structured user/item information, addressing possible biases in synthetic supervision, and deploying TEMs as foundational layers within larger, multi-stage recommender pipelines.

Markdown Report Issue