Papers
Topics
Authors
Recent
Search
2000 character limit reached

Process-Supervised LLM Recommenders via Flow-guided Tuning

Published 10 Mar 2025 in cs.IR | (2503.07377v3)

Abstract: While LLMs are increasingly adapted for recommendation systems via supervised fine-tuning (SFT), this approach amplifies popularity bias due to its likelihood maximization objective, compromising recommendation diversity and fairness. To address this, we present Flow-guided fine-tuning recommender (Flower), which replaces SFT with a Generative Flow Network (GFlowNet) framework that enacts process supervision through token-level reward propagation. Flower's key innovation lies in decomposing item-level rewards into constituent token rewards, enabling direct alignment between token generation probabilities and their reward signals. This mechanism achieves three critical advancements: (1) popularity bias mitigation and fairness enhancement through empirical distribution matching, (2) preservation of diversity through GFlowNet's proportional sampling, and (3) flexible integration of personalized preferences via adaptable token rewards. Experiments demonstrate Flower's superior distribution-fitting capability and its significant advantages over traditional SFT in terms of accuracy, fairness, and diversity, highlighting its potential to improve LLM-based recommendation systems. The implementation is available via https://github.com/MrPeach0301/Flower

Summary

  • The paper introduces Flower, a GFlowNet-based approach that replaces SFT to promote diversity and counter popularity bias in LLM recommenders.
  • It details a prefix tree formulation with recursive state flow and token-level rewards derived via the SubTB objective for fine-grained process supervision.
  • Experiments on Amazon datasets show that Flower achieves superior fairness and diversity while maintaining competitive accuracy compared to traditional SFT models.

This paper introduces Flower (Flow-guided fine-tuning recommender), a novel fine-tuning paradigm for LLM-based recommendation systems designed to overcome the limitations of traditional Supervised Fine-Tuning (SFT). The authors identify two primary issues with SFT in recommendation tasks:

  1. Limited Diversity: SFT's likelihood maximization objective encourages conventional, high-probability outputs, leading to homogeneous recommendations.
  2. Amplification of Popularity Bias: SFT reinforces biases present in training data, causing over-recommendation of popular items and reducing fairness towards niche items.

To address these issues, Flower replaces SFT with a framework based on Generative Flow Networks (GFlowNets). GFlowNets are designed to sample items with probabilities proportional to a given reward function, inherently promoting diversity.

Key Concepts and Implementation of Flower:

  1. Prefix Tree Formulation: The set of all possible item titles in the dataset is conceptualized as a prefix tree. Each path from the root to a leaf node represents a unique item title. States in the GFlowNet correspond to prefixes (partial token sequences), and actions correspond to appending the next valid token.
  2. State Flow Calculation: An item-level (outcome) reward Ro(y)R_o(y) is assigned to each item yy, typically based on its frequency in the training data to match the empirical distribution. The flow F(s)F(s) for any state (prefix) ss is calculated recursively as the sum of the flows of its child states, starting from the terminal states (complete items) where F(y)=Ro(y)F(y) = R_o(y). This calculation leverages the tree structure and requires no additional learning.
  3. Token-level Process Reward: Using the calculated state flows, a token-level (process) reward Rp(st,at)R_p(s_t, a_t) is derived for transitioning from state sts_t (prefix yty_{\le t}) to state st+1s_{t+1} (prefix yt+1y_{\le t+1}) by generating token yt+1y_{t+1}. This reward is defined as Rp(yt,yt+1)=F(st+1)/F(st)R_p(y_{\le t}, y_{t+1}) = F(s_{t+1}) / F(s_t).
  4. Process Supervision: Flower utilizes the Subtrajectory Balance (SubTB) objective from GFlowNets. This objective aligns the LLM's predicted probability distribution for the next token, πθ(yt+1x,yt)\pi_\theta(y_{t+1} | \mathbf{x}, y_{\le t}), with the derived token-level reward distribution Rp(yt,yt+1)R_p(y_{\le t}, y_{t+1}). This provides fine-grained, process-level supervision during generation.
  5. Personalization: To incorporate user preferences, the token-level reward can be modified using a preference score puip_{ui} (likelihood of user uu liking item ii) obtained from an auxiliary model (e.g., SASRec). Two variants are proposed: dividing the log reward by puip_{ui} or multiplying the reward by puip_{ui} before taking the logarithm.
  6. Combined Fine-tuning Loss: The final loss function for Flower combines the standard SFT cross-entropy loss with the GFlowNet SubTB objective, weighted by a hyperparameter λ\lambda:

    LFlower=LSFT+λLR(τm,n)\mathcal{L}_{\rm Flower} = \mathcal{L}_{\rm SFT} + \lambda \sum L_{R}(\tau_{m,n})

Experiments and Results:

  • Experiments were conducted on Amazon datasets (CDs, Video Games, Movies) for next-item recommendation.
  • Baselines included SASRec, BIGRec (SFT), Temp, D3, and IFairLRS.
  • Metrics covered Accuracy (NDCG, HR), Fairness (DGU, MGU), and Diversity (Entropy, TTR).
  • Distribution Fitting: Flower demonstrated significantly better alignment with target distributions compared to SFT, DPO, and PPO in a history-free setting, quantitatively measured by KL and JS divergence.
  • Recommendation Performance: Flower achieved the best fairness and diversity across datasets while maintaining competitive or improved accuracy compared to SFT-based methods. It outperformed fairness-focused methods like IFairLRS and balanced trade-offs better than methods like Temp or D3.
  • Reference Policy: Using Flower as the base model for downstream preference alignment (PPO, DPO variants like S-DPO, RosePO, DMPO) generally yielded better results than using an SFT-based model (BIGRec) as the starting point.
  • Ablation Studies: Finer granularity in token-level supervision (SubTB loss calculated per token) led to better results. The reward formulation incorporating personalization (logRppui\frac{\log R_p}{p_{ui}}) showed the best overall balance. The hyperparameter λ\lambda effectively balances SFT and flow-guided objectives.

Conclusion:

Flower presents a novel GFlowNet-based fine-tuning approach that effectively mitigates popularity bias and enhances diversity in LLM recommenders by providing process-level supervision through token-level rewards. It demonstrates superior performance in fairness and diversity compared to SFT, while maintaining accuracy and serving as a better foundation for further preference alignment.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.