Papers
Topics
Authors
Recent
Search
2000 character limit reached

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Published 12 Jan 2024 in cs.IR | (2401.06470v1)

Abstract: In recent years, there has been a growing interest in utilizing reinforcement learning (RL) to optimize long-term rewards in recommender systems. Since industrial recommender systems are typically designed as multi-stage systems, RL methods with a single agent face challenges when optimizing multiple stages simultaneously. The reason is that different stages have different observation spaces, and thus cannot be modeled by a single agent. To address this issue, we propose a novel UNidirectional-EXecution-based multi-agent Reinforcement Learning (UNEX-RL) framework to reinforce the long-term rewards in multi-stage recommender systems. We show that the unidirectional execution is a key feature of multi-stage recommender systems, bringing new challenges to the applications of multi-agent reinforcement learning (MARL), namely the observation dependency and the cascading effect. To tackle these challenges, we provide a cascading information chain (CIC) method to separate the independent observations from action-dependent observations and use CIC to train UNEX-RL effectively. We also discuss practical variance reduction techniques for UNEX-RL. Finally, we show the effectiveness of UNEX-RL on both public datasets and an online recommender system with over 100 million users. Specifically, UNEX-RL reveals a 0.558% increase in users' usage time compared with single-agent RL algorithms in online A/B experiments, highlighting the effectiveness of UNEX-RL in industrial recommender systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Reinforcing User Retention in a Billion Scale Short Video Recommender System. arXiv preprint arXiv:2302.01724.
  2. Two-Stage Constrained Actor-Critic for Short Video Recommendation. In Proceedings of the ACM Web Conference 2023, 865–875.
  3. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 456–464.
  4. End-to-end user behavior retrieval in click-through rateprediction model. arXiv preprint arXiv:2108.04468.
  5. A theoretical analysis of deep Q-learning. In Learning for dynamics and control, 486–489. PMLR.
  6. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29.
  7. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  8. Addressing function approximation error in actor-critic methods. In International conference on machine learning, 1587–1596. PMLR.
  9. KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 540–550.
  10. KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 3953–3957.
  11. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247.
  12. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, 2333–2338.
  13. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1754–1763.
  14. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
  15. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30.
  16. A concise introduction to decentralized POMDPs, volume 1. Springer.
  17. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34: 12208–12221.
  18. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2685–2692.
  19. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 21(1): 7234–7284.
  20. Rendle, S. 2010. Factorization machines. In 2010 IEEE International conference on data mining, 995–1000. IEEE.
  21. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning, volume 133. Springer.
  22. An MDP-based recommender system. Journal of Machine Learning Research, 6(9).
  23. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29.
  24. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021, 1785–1797.
  25. Dop: Off-policy multi-agent decomposed policy gradients. In International conference on learning representations.
  26. Surrogate for long-term user experience in recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4100–4109.
  27. Cold: Towards the next generation of pre-ranking system. arXiv preprint arXiv:2007.16122.
  28. PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement.
  29. ResAct: Reinforcing Long-term Engagement in Sequential Recommendation with Residual Actor. arXiv preprint arXiv:2206.02620.
  30. Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4472–4481.
  31. Rethinking the Role of Pre-ranking in Large-scale E-Commerce Searching System. arXiv preprint arXiv:2305.13647.
  32. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1040–1048.
  33. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 world wide web conference, 167–176.
  34. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 5941–5948.
  35. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1059–1068.
  36. Reinforcement learning to optimize long-term user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2810–2818.
Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.