Process-Supervised LLM Recommenders via Flow-guided Tuning

Published 10 Mar 2025 in cs.IR | (2503.07377v3)

Abstract: While LLMs are increasingly adapted for recommendation systems via supervised fine-tuning (SFT), this approach amplifies popularity bias due to its likelihood maximization objective, compromising recommendation diversity and fairness. To address this, we present Flow-guided fine-tuning recommender (Flower), which replaces SFT with a Generative Flow Network (GFlowNet) framework that enacts process supervision through token-level reward propagation. Flower's key innovation lies in decomposing item-level rewards into constituent token rewards, enabling direct alignment between token generation probabilities and their reward signals. This mechanism achieves three critical advancements: (1) popularity bias mitigation and fairness enhancement through empirical distribution matching, (2) preservation of diversity through GFlowNet's proportional sampling, and (3) flexible integration of personalized preferences via adaptable token rewards. Experiments demonstrate Flower's superior distribution-fitting capability and its significant advantages over traditional SFT in terms of accuracy, fairness, and diversity, highlighting its potential to improve LLM-based recommendation systems. The implementation is available via https://github.com/MrPeach0301/Flower

Abstract PDF Upgrade to Chat

Summary

The paper introduces Flower, a GFlowNet-based approach that replaces SFT to promote diversity and counter popularity bias in LLM recommenders.
It details a prefix tree formulation with recursive state flow and token-level rewards derived via the SubTB objective for fine-grained process supervision.
Experiments on Amazon datasets show that Flower achieves superior fairness and diversity while maintaining competitive accuracy compared to traditional SFT models.

This paper introduces Flower (Flow-guided fine-tuning recommender), a novel fine-tuning paradigm for LLM-based recommendation systems designed to overcome the limitations of traditional Supervised Fine-Tuning (SFT). The authors identify two primary issues with SFT in recommendation tasks:

Limited Diversity: SFT's likelihood maximization objective encourages conventional, high-probability outputs, leading to homogeneous recommendations.
Amplification of Popularity Bias: SFT reinforces biases present in training data, causing over-recommendation of popular items and reducing fairness towards niche items.

To address these issues, Flower replaces SFT with a framework based on Generative Flow Networks (GFlowNets). GFlowNets are designed to sample items with probabilities proportional to a given reward function, inherently promoting diversity.

Key Concepts and Implementation of Flower:

Prefix Tree Formulation: The set of all possible item titles in the dataset is conceptualized as a prefix tree. Each path from the root to a leaf node represents a unique item title. States in the GFlowNet correspond to prefixes (partial token sequences), and actions correspond to appending the next valid token.
State Flow Calculation: An item-level (outcome) reward $R_o(y)$ is assigned to each item $y$ , typically based on its frequency in the training data to match the empirical distribution. The flow $F(s)$ for any state (prefix) $s$ is calculated recursively as the sum of the flows of its child states, starting from the terminal states (complete items) where $F(y) = R_o(y)$ . This calculation leverages the tree structure and requires no additional learning.
Token-level Process Reward: Using the calculated state flows, a token-level (process) reward $R_p(s_t, a_t)$ is derived for transitioning from state $s_t$ (prefix $y_{\le t}$ ) to state $s_{t+1}$ (prefix $y_{\le t+1}$ ) by generating token $y_{t+1}$ . This reward is defined as $R_p(y_{\le t}, y_{t+1}) = F(s_{t+1}) / F(s_t)$ .
Process Supervision: Flower utilizes the Subtrajectory Balance (SubTB) objective from GFlowNets. This objective aligns the LLM's predicted probability distribution for the next token, $\pi_\theta(y_{t+1} | \mathbf{x}, y_{\le t})$ , with the derived token-level reward distribution $R_p(y_{\le t}, y_{t+1})$ . This provides fine-grained, process-level supervision during generation.
Personalization: To incorporate user preferences, the token-level reward can be modified using a preference score $p_{ui}$ (likelihood of user $u$ liking item $i$ ) obtained from an auxiliary model (e.g., SASRec). Two variants are proposed: dividing the log reward by $p_{ui}$ or multiplying the reward by $p_{ui}$ before taking the logarithm.
Combined Fine-tuning Loss: The final loss function for Flower combines the standard SFT cross-entropy loss with the GFlowNet SubTB objective, weighted by a hyperparameter $\lambda$ :

$\mathcal{L}_{\rm Flower} = \mathcal{L}_{\rm SFT} + \lambda \sum L_{R}(\tau_{m,n})$

Experiments and Results:

Experiments were conducted on Amazon datasets (CDs, Video Games, Movies) for next-item recommendation.
Baselines included SASRec, BIGRec (SFT), Temp, D3, and IFairLRS.
Metrics covered Accuracy (NDCG, HR), Fairness (DGU, MGU), and Diversity (Entropy, TTR).
Distribution Fitting: Flower demonstrated significantly better alignment with target distributions compared to SFT, DPO, and PPO in a history-free setting, quantitatively measured by KL and JS divergence.
Recommendation Performance: Flower achieved the best fairness and diversity across datasets while maintaining competitive or improved accuracy compared to SFT-based methods. It outperformed fairness-focused methods like IFairLRS and balanced trade-offs better than methods like Temp or D3.
Reference Policy: Using Flower as the base model for downstream preference alignment (PPO, DPO variants like S-DPO, RosePO, DMPO) generally yielded better results than using an SFT-based model (BIGRec) as the starting point.
Ablation Studies: Finer granularity in token-level supervision (SubTB loss calculated per token) led to better results. The reward formulation incorporating personalization ( $\frac{\log R_p}{p_{ui}}$ ) showed the best overall balance. The hyperparameter $\lambda$ effectively balances SFT and flow-guided objectives.

Conclusion:

Flower presents a novel GFlowNet-based fine-tuning approach that effectively mitigates popularity bias and enhances diversity in LLM recommenders by providing process-level supervision through token-level rewards. It demonstrates superior performance in fairness and diversity compared to SFT, while maintaining accuracy and serving as a better foundation for further preference alignment.

Markdown Report Issue