UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution
Abstract: In recent years, there has been a growing interest in utilizing reinforcement learning (RL) to optimize long-term rewards in recommender systems. Since industrial recommender systems are typically designed as multi-stage systems, RL methods with a single agent face challenges when optimizing multiple stages simultaneously. The reason is that different stages have different observation spaces, and thus cannot be modeled by a single agent. To address this issue, we propose a novel UNidirectional-EXecution-based multi-agent Reinforcement Learning (UNEX-RL) framework to reinforce the long-term rewards in multi-stage recommender systems. We show that the unidirectional execution is a key feature of multi-stage recommender systems, bringing new challenges to the applications of multi-agent reinforcement learning (MARL), namely the observation dependency and the cascading effect. To tackle these challenges, we provide a cascading information chain (CIC) method to separate the independent observations from action-dependent observations and use CIC to train UNEX-RL effectively. We also discuss practical variance reduction techniques for UNEX-RL. Finally, we show the effectiveness of UNEX-RL on both public datasets and an online recommender system with over 100 million users. Specifically, UNEX-RL reveals a 0.558% increase in users' usage time compared with single-agent RL algorithms in online A/B experiments, highlighting the effectiveness of UNEX-RL in industrial recommender systems.
- Reinforcing User Retention in a Billion Scale Short Video Recommender System. arXiv preprint arXiv:2302.01724.
- Two-Stage Constrained Actor-Critic for Short Video Recommendation. In Proceedings of the ACM Web Conference 2023, 865–875.
- Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 456–464.
- End-to-end user behavior retrieval in click-through rateprediction model. arXiv preprint arXiv:2108.04468.
- A theoretical analysis of deep Q-learning. In Learning for dynamics and control, 486–489. PMLR.
- Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29.
- Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, 1587–1596. PMLR.
- KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 540–550.
- KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 3953–3957.
- DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247.
- Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, 2333–2338.
- xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1754–1763.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30.
- A concise introduction to decentralized POMDPs, volume 1. Springer.
- Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34: 12208–12221.
- Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2685–2692.
- Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 21(1): 7234–7284.
- Rendle, S. 2010. Factorization machines. In 2010 IEEE International conference on data mining, 995–1000. IEEE.
- The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning, volume 133. Springer.
- An MDP-based recommender system. Journal of Machine Learning Research, 6(9).
- Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29.
- Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021, 1785–1797.
- Dop: Off-policy multi-agent decomposed policy gradients. In International conference on learning representations.
- Surrogate for long-term user experience in recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4100–4109.
- Cold: Towards the next generation of pre-ranking system. arXiv preprint arXiv:2007.16122.
- PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement.
- ResAct: Reinforcing Long-term Engagement in Sequential Recommendation with Residual Actor. arXiv preprint arXiv:2206.02620.
- Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4472–4481.
- Rethinking the Role of Pre-ranking in Large-scale E-Commerce Searching System. arXiv preprint arXiv:2305.13647.
- Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1040–1048.
- DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 world wide web conference, 167–176.
- Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 5941–5948.
- Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1059–1068.
- Reinforcement learning to optimize long-term user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2810–2818.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.