- The paper introduces Flower, a GFlowNet-based approach that replaces SFT to promote diversity and counter popularity bias in LLM recommenders.
- It details a prefix tree formulation with recursive state flow and token-level rewards derived via the SubTB objective for fine-grained process supervision.
- Experiments on Amazon datasets show that Flower achieves superior fairness and diversity while maintaining competitive accuracy compared to traditional SFT models.
This paper introduces Flower (Flow-guided fine-tuning recommender), a novel fine-tuning paradigm for LLM-based recommendation systems designed to overcome the limitations of traditional Supervised Fine-Tuning (SFT). The authors identify two primary issues with SFT in recommendation tasks:
- Limited Diversity: SFT's likelihood maximization objective encourages conventional, high-probability outputs, leading to homogeneous recommendations.
- Amplification of Popularity Bias: SFT reinforces biases present in training data, causing over-recommendation of popular items and reducing fairness towards niche items.
To address these issues, Flower replaces SFT with a framework based on Generative Flow Networks (GFlowNets). GFlowNets are designed to sample items with probabilities proportional to a given reward function, inherently promoting diversity.
Key Concepts and Implementation of Flower:
- Prefix Tree Formulation: The set of all possible item titles in the dataset is conceptualized as a prefix tree. Each path from the root to a leaf node represents a unique item title. States in the GFlowNet correspond to prefixes (partial token sequences), and actions correspond to appending the next valid token.
- State Flow Calculation: An item-level (outcome) reward Ro(y) is assigned to each item y, typically based on its frequency in the training data to match the empirical distribution. The flow F(s) for any state (prefix) s is calculated recursively as the sum of the flows of its child states, starting from the terminal states (complete items) where F(y)=Ro(y). This calculation leverages the tree structure and requires no additional learning.
- Token-level Process Reward: Using the calculated state flows, a token-level (process) reward Rp(st,at) is derived for transitioning from state st (prefix y≤t) to state st+1 (prefix y≤t+1) by generating token yt+1. This reward is defined as Rp(y≤t,yt+1)=F(st+1)/F(st).
- Process Supervision: Flower utilizes the Subtrajectory Balance (SubTB) objective from GFlowNets. This objective aligns the LLM's predicted probability distribution for the next token, πθ(yt+1∣x,y≤t), with the derived token-level reward distribution Rp(y≤t,yt+1). This provides fine-grained, process-level supervision during generation.
- Personalization: To incorporate user preferences, the token-level reward can be modified using a preference score pui (likelihood of user u liking item i) obtained from an auxiliary model (e.g., SASRec). Two variants are proposed: dividing the log reward by pui or multiplying the reward by pui before taking the logarithm.
- Combined Fine-tuning Loss: The final loss function for Flower combines the standard SFT cross-entropy loss with the GFlowNet SubTB objective, weighted by a hyperparameter λ:
LFlower=LSFT+λ∑LR(τm,n)
Experiments and Results:
- Experiments were conducted on Amazon datasets (CDs, Video Games, Movies) for next-item recommendation.
- Baselines included SASRec, BIGRec (SFT), Temp, D3, and IFairLRS.
- Metrics covered Accuracy (NDCG, HR), Fairness (DGU, MGU), and Diversity (Entropy, TTR).
- Distribution Fitting: Flower demonstrated significantly better alignment with target distributions compared to SFT, DPO, and PPO in a history-free setting, quantitatively measured by KL and JS divergence.
- Recommendation Performance: Flower achieved the best fairness and diversity across datasets while maintaining competitive or improved accuracy compared to SFT-based methods. It outperformed fairness-focused methods like IFairLRS and balanced trade-offs better than methods like Temp or D3.
- Reference Policy: Using Flower as the base model for downstream preference alignment (PPO, DPO variants like S-DPO, RosePO, DMPO) generally yielded better results than using an SFT-based model (BIGRec) as the starting point.
- Ablation Studies: Finer granularity in token-level supervision (SubTB loss calculated per token) led to better results. The reward formulation incorporating personalization (puilogRp) showed the best overall balance. The hyperparameter λ effectively balances SFT and flow-guided objectives.
Conclusion:
Flower presents a novel GFlowNet-based fine-tuning approach that effectively mitigates popularity bias and enhances diversity in LLM recommenders by providing process-level supervision through token-level rewards. It demonstrates superior performance in fairness and diversity compared to SFT, while maintaining accuracy and serving as a better foundation for further preference alignment.