Why Thinking Hurts? Diagnosing and Rectifying the Reasoning Shift in Foundation Recommender Models

Published 18 Feb 2026 in cs.IR | (2602.16587v1)

Abstract: Integrating Chain-of-Thought (CoT) reasoning into Semantic ID-based recommendation foundation models (such as OpenOneRec) often paradoxically degrades recommendation performance. We identify the root cause as textual inertia from the General Subspace, where verbose reasoning dominates inference and causes the model to neglect critical Semantic ID. To address this, we propose a training-free Inference-Time Subspace Alignment framework. By compressing reasoning chains and applying bias-subtracted contrastive decoding, our approach mitigates ungrounded textual drift. Experiments show this effectively calibrates inference, allowing foundation models to leverage reasoning without sacrificing ID-grounded accuracy.

Abstract PDF Upgrade to Chat

Summary

The paper identifies that chain-of-thought reasoning induces a dominant general subspace prior that misaligns SID-grounded signals, leading to performance drops.
The paper introduces a training-free, inference-time alignment framework that uses reasoning chain compression and bias-subtracted contrastive decoding to recalibrate recommendations.
The paper validates its approach with strong empirical gains, achieving up to a 94.9% improvement in Recall@1 and highlighting the need for explicit subspace alignment.

Diagnosing and Rectifying Reasoning-Induced Performance Degradation in Foundation Recommender Models

Introduction

The integration of Chain-of-Thought (CoT) reasoning into foundation models with Semantic ID (SID) generation, typified by architectures like OpenOneRec, has been proposed to endow recommender systems with enhanced reasoning and explainability. However, empirical evidence reveals consistent and sometimes severe performance degradation when explicit multi-step reasoning is enabled in these systems, a phenomenon antithetical to expectations drawn from general-purpose LLMs. This work delivers a granular diagnosis of this “reasoning shift,” positing that CoT augments a General Subspace prior—an ungrounded textual inertia—at the expense of SID-grounded signals. The authors introduce a training-free, inference-time subspace alignment framework that systematically rectifies this misalignment, recovering and enhancing the performance of reasoning-enabled recommendation.

Analyzing the Reasoning Shift: Subspace Misalignment and Textual Inertia

The architecture of SID-based generative recommenders jointly embeds tokens from two only partially overlapping subspaces: one comprising SIDs (discrete item representations) and the other general textual instructions, explanations, and rationale. Through empirical analysis, including principal component analysis of embedding distributions, the paper demonstrates that these subspaces remain semantically distinct despite some alignment due to pretraining.

A key technical finding is that reasoning chains, typically generated in the textual subspace, induce a dominant “General Subspace prior” during decoding. This leads to the model overweighing linguistic fluency and verbose logic at the expense of evidence grounded in true user-item interaction histories. Using conditional pointwise mutual information (CPMI), the log-probability decomposition $S(y|x,c) = \text{CPMI}(y;x|c) + S(y|c)$ exposes how the General Subspace prior can suppress SID-consistent signals.

Attention analysis further reveals that enabling CoT drastically increases the ratio of model attention on general text versus SID tokens, quantified via a Space Dominance Index (SDI) that rises under “thinking mode.” The lengthening of reasoning chains also dilutes the attention per text token, exacerbating the loss of ID-specific evidence and leading to a significant recall and NDCG drop across benchmarks—a core numerical finding.

Training-Free Inference-Time Subspace Alignment

To mitigate the reasoning-induced subspace misalignment without retraining foundation models, the authors propose a two-component inference-time recalibration scheme:

Reasoning Chain Compression: Instead of conditioning generation on free-form, verbose CoT, the model uses a deterministic transformation to produce a strictly templated, compressed summary of the core inferred user preferences. This projection onto a low-entropy, preference-focused control variable eliminates much of the textual inertia, maintaining only salient preference cues.
Bias-Subtracted Contrastive Decoding: To explicitly correct for reasoning-induced distributional drift, contrastive inference is employed using three contexts: (E) history plus compressed CoT, (A) CoT-only, and (B) history-only. Candidate item scores are normalized and recalibrated using a bias-subtracted formula, penalizing only the CoT-induced excess beyond evidence-supported SID likelihood. This precision avoids the naive suppression of genuinely grounded intermediate deductions.

This entire procedure operates at inference, obviating the need for fine-tuning or retraining, and leverages any small LLM as the chain compressor for maximal practicality.

Experimental Results

Comprehensive evaluations on OpenOneRec and Qwen (1.7B/8B) backbones, across multiple datasets, show strong quantitative recovery and improvement over both non-thinking and naive thinking modes. For instance, on challenging domains such as Product, standard CoT-enabled models show clear recall and NDCG drops, underscoring the reality and severity of reasoning-induced drift. The proposed alignment method not only corrects this degradation but, in multiple settings, outperforms all baselines (with up to 94.9% improvement on Recall@1 in certain settings).

Further, scaling the model (from 1.7B to 8B parameters) increases non-thinking baseline performance but does not resolve the instability under CoT reasoning; alignment remains indispensable across scales, suggesting that mere capacity increase amplifies, rather than mitigates, subspace dominance.

Implications and Future Directions

This work has significant implications for the deployment of LLM-based recommender systems, especially where explainability and reasoning are operational constraints. It foregrounds the risk of naive CoT integration in SID-based recommenders: unless subspace alignment is explicitly managed, models may systematically sacrifice predictive grounding for linguistic plausibility. The training-free, inference-time realignment paradigm offers a practical mitigation that is highly compatible with evolving LLM architectures.

Theoretically, these results highlight the need for further research into representation space geometry in multi-task, multi-modal LLMs, particularly regarding cross-subspace transfer and dominance. More robust alignment mechanisms—including structured reasoning, hierarchical control variables, or explicit subspace disentanglement at training—are promising next ventures. Additionally, extending these insights to other domains with competing latent subspaces (e.g., retrieval-augmented generation, task-oriented dialog) could yield broader architectural prescriptions.

Conclusion

This paper provides a rigorous diagnosis of why explicit reasoning can impair, rather than enhance, foundation recommender models: it identifies subspace misalignment and textual inertia as the root causes. The authors deliver a practical, training-free inference-time subspace alignment framework that enables models to leverage the benefits of reasoning without losing SID-grounded accuracy. The empirical results are robust and suggest that careful alignment, not naive CoT integration, is key to unlocking faithful, interpretable, and high-performance foundation model recommendations (2602.16587).

Markdown Report Issue