Do Reviews Matter for Recommendations in the Era of Large Language Models?

Published 15 Dec 2025 in cs.IR | (2512.12978v1)

Abstract: With the advent of LLMs, the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: https://github.com/zhytk/RAREval-data-processing.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the RAREval framework to systematically benchmark the marginal utility of textual reviews in LLM-powered recommendation systems.
LLM-based methods, especially REVLoRA fine-tuning, maintain strong performance even when review data is reduced or distorted.
The findings challenge traditional reliance on reviews, suggesting that refined rating signals and curated meta-data may offer greater predictive power.

Do Reviews Matter for Recommendations in the Era of LLMs?

Introduction

The integration of LLMs into recommender systems has prompted fundamental questions about the role of user-generated text reviews in recommendation pipelines. "Do Reviews Matter for Recommendations in the Era of LLMs?" (2512.12978) provides a systematic empirical analysis of the contribution of textual reviews by benchmarking classical deep learning, graph-enhanced, and LLM-based architectures under controlled review availability and quality scenarios. The work introduces the RAREval framework, enabling controlled ablations and distortions of review data, and highlights strong, sometimes counterintuitive findings regarding the informativeness, redundancy, and robustness of review signals in the presence of LLMs.

Figure 1: A comprehensive comparison of recommendation models evaluated using the RAREval framework, which reports average MAE across Amazon datasets under review removal, distortion, and reduction. RAREval further assesses model performance under data sparsity and cold-start scenarios.

Background and Methodological Framework

Historically, reviews have been core to review-aware recommender architectures (HFT, TopicMF, DeepCoNN, RGCL, DIRECT, etc.), which encode review semantics as user/item representations or graph augmentations. The informativeness of reviews is assumed to arise from their ability to describe fine-grained preferences and item facets that classical collaborative signals cannot recover. However, LLMs introduce strong pre-trained priors, few-shot/zero-shot compositional abilities, and high-capacity text understanding. Most recent LLM-based recommenders omit reviews or restrict their usage to meta-data summarization, leaving open the fundamental question of their marginal value in LLM-centric pipelines.

The authors develop RAREval, an evaluation protocol and benchmark that allows for principled comparison and ablation of review signal strength under five dimensions: complete removal, random reduction, synthetic distortion, interaction sparsity (via $k$ -core filtering), and cold-start user regimes. Both classical and LLM-centric recommender backbones are included, and LLMs are evaluated in zero-shot prompt, few-shot prompt, and parameter-efficient finetuning (REVLoRA) modalities.

Figure 2: The RAREval framework evaluates review-aware recommender systems across five distinct settings, each derived from the original dataset.

LLM-based Review-aware Recommendation Architectures

The study implements three LLM-driven settings:

Zero-shot: Predict the rating using prompt templates encoding user/item IDs and concatenated sets of reviews, without task-specific examples.
Few-shot: Prompt engineering includes sampled demonstrations from distinct user-item pairs, appended to the target prediction prompt.
Fine-tuning (REVLoRA): Leverages LoRA with a structured prompt encompassing extracted and LLM-summarized user and item features, alongside explicit rating sequences, and replaces the generation layer with a regression head.

Prompt formatting and workflow differ substantially among paradigms, as visualized in the following figure:

Figure 3: Illustration of prompt formats and workflows for zero-shot, few-shot, and REVLoRA fine-tuning settings.

Experimental Results

Comparative Performance

Extensive experiments are conducted on Amazon 5-core datasets, assessing deep models (DeepCoNN, RGCL, DIRECT, LURP) and LLM-based methods (Llama 1B/3B, Qwen 0.5B/3B). The core findings are:

REVLoRA (LLM + LoRA fine-tuning) delivers the best MAE across most product categories.
Zero-shot and few-shot LLMs do not consistently outperform each other; calibration by few-shot prompts does not guarantee gains over zero-shot in Llama series but is sometimes beneficial in Qwen models.
Model scaling (Llama 1B → 3B) always reduces MAE, but Qwen scaling exhibits inconsistent improvements.
In MSE, RGCL and DIRECT sometimes outperform LLM-based solutions, denoting the superiority of graph reasoning for interaction-rich regimes.
Value of reviews is dataset- and sparsity-dependent, with ratings alone already sufficing for strong LLM-based performance.
Figure 4: MAE was evaluated on four Amazon datasets using Llama 3B (zero-shot/few-shot) and Llama 1B+LoRA (REVLoRA).

Ablation on Review Utility

Rigorous ablation isolates the value add of review texts versus ratings. The principal observations are:

LLMs consistently perform worse using only review text compared to only ratings, even in large prompt contexts.
Combining ratings and reviews does not consistently yield improvements over the rating-only setting for LLMs, challenging assumptions about the inherent value of textual reviews when the model already observes explicit feedback.
Classical CNN-based models (DeepCoNN) and zero-shot LLMs are often distracted by the inclusion of review text, likely due to contradictory or non-informative reviews.
Figure 5: MAE results on four datasets; dark bars indicate default settings (with review), light bars indicate review-ablated settings.

Review Reduction and Distortion

Systematic review reduction and distortion experiments yield the following insights:

Partial reduction in available reviews produces intermediate MAE between the full and none condition, but for most LLM paradigms, removing review text does not significantly degrade accuracy; sometimes, it even slightly improves results.
Models show heterogeneous robustness to distorted reviews (random shuffling): DeepCoNN degrades substantially; RGCL and REVLoRA are highly robust.
LLM-based recommenders, especially REVLoRA, are remarkably insensitive to both review removal and textual noise, indicating a strong reliance on pre-trained priors and explicit rating data over review semantics.
Figure 6: MAE comparison under varying review data reduction levels across four datasets. Closer to center indicates lower MAE.

Figure 7: MAE comparison under varying review distortion levels across four datasets. Closer to center denotes lower MAE.

Data Sparsity and Cold-Start

Under interaction sparsity ablation ( $k$ -core), RGCL, zero-shot/few-shot LLM, and REVLoRA outperform DIRECT at low $k$ .
As data becomes denser, RGCL eventually surpasses LLMs except for finetuned REVLoRA.
DeepCoNN is acutely sensitive to sparsity.
REVLoRA demonstrates consistent strength in highly sparse and cold-start regimes, with the performance margin particularly significant for users with limited historical interactions ( $f \leq 5$ ).
Figure 8: MAE comparison under varying user-item interaction sparsity across four datasets.

Figure 9: Cold-start MAE comparison across four Amazon datasets under different user training frequencies.

Implications and Theoretical Significance

The main empirical result is that the marginal utility of text reviews in LLM-based recommenders is strikingly limited—their removal or distortion rarely leads to accuracy drops, and in some architectures even triggers marginal improvements. This undermines assumptions that the semantic richness of reviews is vital once high-capacity, strongly pre-trained sequence models are deployed and given explicit feedback signals. Review text may contain informationally redundant, misaligned, or contradictory content (as previously reported in DIRECT [wu2024direct]), and LLMs appear to rely on priors plus rating distributions, making weak use of unstructured textual input barring explicit inductive bias or model adaptation. Notably, in cold-start and data-sparse regimes, finetuned LLMs (REVLoRA) maintain or even enhance performance, supporting their use for new-user adaptation.

From a systems perspective, this suggests review text summarization or dimensionality reduction—rather than naive concatenation—can yield computational efficiency without sacrificing, and occasionally improving, predictive performance. It also prompts re-examination of the architectural inductive bias in LLM-based recommendation systems: unless textual data is processed with semantic alignment and aspect-oriented supervision, the model may default to ignoring or even being distracted by unstructured reviews.

Practical Recommendations and Future Directions

Operators of LLM-powered recommenders should reconsider the resource allocation to end-to-end review ingestion/processing and instead focus on rating curation, context-aware user history summarization, or enhanced meta-data extraction.
There is a demand for better prompt or architecture-level methods forcing LLMs to attend to and reason from high-signal review text, potentially combining aspect extraction, review usefulness filtering, or sequential attention.
New benchmarks should evaluate the effect of review quality and contradiction rates explicitly, correlating model performance with human notions of review informativeness, trustworthiness, and alignment with ratings.
There is significant unexplored space in sequential review-aware rating prediction, which may exhibit different behavior than static rating prediction under causal constraints.

Conclusion

This work asserts that in the era of LLM-based recommendation systems, textual user reviews deliver, at best, minimal marginal utility for rating prediction performance—a claim validated under extensive ablation and robustness analyses. These results recalibrate theoretical expectations about the function and value of user-generated text, emphasizing that conventional faith in the utility of textual reviews does not automatically extend to modern high-capacity neural architectures. Future research should seek architectures and training protocols that more selectively and robustly extract relevant signal from unstructured reviews, rather than assuming their utility as input for LLM-dominated recommenders.

(2512.12978)

Markdown Report Issue