Structured Reinforcement Learning for Media Streaming at the Wireless Edge
Abstract: Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15$\mu$s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.
- Constrained policy optimization. In International Conference on Machine Learning. PMLR, 22–31.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research 22, 98 (2021), 1–76.
- On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. Journal of Machine Learning Research 22, 98 (2021), 1–76.
- Eitan Altman. 1999. Constrained Markov decision processes. Vol. 7. CRC Press.
- Eitan Altman. 2002. Applications of Markov decision processes in communication networks. In Handbook of Markov decision processes. Springer, 489–536.
- Shalabh Bhatnagar. 2010. An actor–critic algorithm with function approximation for discounted cost constrained Markov decision processes. Systems & Control Letters 59, 12 (2010), 760–766.
- Qflow: A reinforcement learning approach to high qoe video streaming over wireless networks. In Proceedings of the twentieth ACM international symposium on mobile ad hoc networking and computing. 251–260.
- Vivek S Borkar. 2005. An actor-critic algorithm for constrained Markov decision processes. Systems & control letters 54, 3 (2005), 207–213.
- Vivek S Borkar. 2009. Stochastic approximation: a dynamical systems viewpoint. Vol. 48. Springer.
- DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning. arXiv preprint arXiv:2112.00885 (2021).
- Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics. PMLR, 3304–3312.
- Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems 33 (2020), 8378–8390.
- Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes.. In Advances in Neural Information Processing Systems (NeurIPS).
- Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189 (2020).
- A Continuous QoE Evaluation Framework for Video Streaming over HTTP. IEEE Transactions on Circuits and Systems for Video Technology In press (2017). https://doi.org/10.1109/TCSVT.2017.2742601
- Towards Network-wide QoE Fairness Using Openflow-assisted Adaptive Video Streaming. In Proceedings of ACM FhMN.
- Learning a Continuous-Time Streaming Video QoE Model. IEEE Transactions on Image Processing 27, 5 (May 2018), 2257–2271. https://doi.org/10.1109/TIP.2018.2790347
- Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7667–7674.
- Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes.. In IJCAI. 2519–2525.
- Hewlett Packard Enterprise. 2021. Orange demos dynamic, self-healing 5G network slice management with HPE and Casa Systems. https://www.hpe.com/us/en/newsroom/press-release/2020/07/orange-demos-dynamic-self-healing-5g-network-slice-management-with-hpe-and-casa-systems.html.
- A theory of QoS for wireless. In IEEE INFOCOM 2009. Rio de Janeiro, Brazil.
- Ping-Chun Hsieh and I-Hong Hou. 2018. Heavy-traffic analysis of QoE optimality for on-demand video streams over fading channels. IEEE/ACM Transactions on Networking 26, 4 (2018), 1768–1781.
- SDN-based Application-Aware Networking on the Example of YouTube Video Streaming. In Proceedings of EWSDN.
- A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 8030–8037.
- Fast global convergence of policy optimization for constrained mdps. arXiv preprint arXiv:2111.00552 (2021).
- Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs. In Thirty-fifth Conference on Neural Information Processing Systems.
- Neural adaptive video streaming with pensieve. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 197–210.
- Open Networking Foundation. 2021. SD-RAN: Software Defined Radio Access Network. https://opennetworking.org/sd-ran/.
- Avoiding interruptions—A QoE reliability function for streaming media applications. IEEE Journal on Selected Areas in Communications 29, 5 (2011), 1064–1074.
- Constrained Reinforcement Learning Has Zero Duality Gap. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019), 7555–7565.
- SDN Based QoE Optimization for HTTP-Based Adaptive Video Streaming. In Proceedings of IEEE ISM.
- A modular http adaptive streaming qoe model—candidate for itu-t p. 1203 (“p. nats”). In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 1–6.
- Online reinforcement learning of optimal threshold policies for Markov decision processes. IEEE Trans. Automat. Control 67, 7 (2021), 3722–3729.
- Sandvine. 2021. The Mobile Internet Phenomena Report. https://www.sandvine.com/phenomena.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
- Rahul Singh and PR Kumar. 2019. Optimal Decentralized Dynamic Policies for Video Streaming over Wireless Channels. arXiv preprint arXiv:1902.07418 (2019).
- L. Tassiulas and A. Ephremides. 1992. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Trans. Automat. Control 37, 12 (Dec. 1992), 1936–1948.
- Near-Optimal Sample Complexity Bounds for Constrained MDPs. arXiv preprint arXiv:2206.06270 (2022).
- Triple-Q: A Model-Free Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation. In International Conference on Artificial Intelligence and Statistics. PMLR, 3274–3307.
- Peter Whittle. 1988. Restless bandits: Activity allocation in a changing world. Journal of applied probability 25, A (1988), 287–298.
- Projection-Based Constrained Policy Optimization. In International Conference on Learning Representations (ICLR).
- Delivery quality score model for Internet video. In Proceedings of IEEE ICIP. https://doi.org/10.1109/ICIP.2014.7025402
- First Order Constrained Optimization in Policy Space. Advances in Neural Information Processing Systems (NeurIPS) 33 (2020).
- Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning. arXiv preprint arXiv:2206.05357 (2022).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.