Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-agent cooperation through learning-aware policy gradients

Published 24 Oct 2024 in cs.AI | (2410.18636v2)

Abstract: Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Melting Pot 2.0. arXiv preprint arXiv:2211.13746, 2023.
  2. What learning algorithm is in-context learning? Investigations with linear models. In International Conference on Learning Representations, 2023.
  3. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024.
  4. R. Axelrod and W. D. Hamilton. The evolution of cooperation. Science, 211(4489):1390–1396, Mar. 1981.
  5. The DeepMind JAX Ecosystem, 2020.
  6. The good shepherd: An oracle agent for mechanism design. arXiv preprint arXiv:2202.10135, 2022.
  7. R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, pages 679–684, 1957.
  8. Learning a synaptic learning rule. Technical report, Université de Montréal, Département d’Informatique et de Recherche opérationnelle, 1990.
  9. JAX: composable transformations of Python+NumPy programs, 2018.
  10. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 2020.
  11. C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998.
  12. Building machines that learn and think with people. arXiv preprint arXiv:2408.03943, 2024.
  13. Meta-value learning: a general framework for learning with learning awareness. arXiv preprint arXiv:2307.08863, 2023.
  14. Griffin: mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024.
  15. RL2: Fast reinforcement learning via slow reinforcement learning. In International Conference on Learning Representations, 2017.
  16. A social path to human-like artificial intelligence. Nature Machine Intelligence, 5(11):1181–1188, 2023.
  17. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
  18. Learning with opponent-learning awareness. In International Conference on Autonomous Agents and Multiagent Systems, 2018a.
  19. DiCE: The infinitely differentiable Monte Carlo estimator. In International Conference on Machine Learning, 2018b.
  20. D. Fudenberg and D. K. Levine. The theory of learning in games, volume 2. MIT press, 1998.
  21. Socially intelligent machines that learn from humans and help humans learn. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381(2251):20220048, July 2023.
  22. G. Hardin. The tragedy of the commons. Science, 162(3859):1243–1248, 1968.
  23. Array programming with NumPy. Nature, 585(7825):357–362, 2020.
  24. Flax: A neural network library and ecosystem for JAX, 2024. URL http://github.com/google/flax.
  25. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183, 2017.
  26. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
  27. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, Lecture Notes in Computer Science. Springer, 2001.
  28. J. D. Hunter. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007.
  29. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99–134, 1998.
  30. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  31. Scaling opponent shaping to high dimensional games. In International Conference on Autonomous Agents and Multiagent Systems, 2024.
  32. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In International Conference on Machine Learning, 2021.
  33. H. W. Kuhn. Extensive games and the problem of information. Princeton University Press, 1953.
  34. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
  35. Multi-agent reinforcement learning in sequential social dilemmas. In International Conference on Autonomous Agents and Multiagent Systems, 2017.
  36. Stable opponent shaping in differentiable games. In International Conference on Learning Representations, 2019.
  37. Transformers as algorithms: generalization and stability in in-context learning. In International Conference on Machine Learning, 2023.
  38. Model-free opponent shaping. In International Conference on Machine Learning, 2022.
  39. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
  40. J. F. Nash Jr. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48–49, 1950.
  41. M. A. Nowak and K. Sigmund. Tit for tat in heterogeneous populations. Nature, 355(6357):250–253, 1992. Publisher: Nature Publishing Group UK London.
  42. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023.
  43. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26):10409–10413, 2012.
  44. N. C. Rabinowitz. Meta-learners’ learning dynamics are unlike learners’. arXiv preprint arXiv:1905.01320, 2019.
  45. A. Rapoport. Prisoner’s dilemma—recollections and observations. In Game Theory as a Theory of a Conflict Resolution, pages 17–34. Springer, 1974.
  46. I. Rechenberg and M. Eigen. Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog Verlag, 1973.
  47. Evolutionary dynamics of social dilemmas in structured heterogeneous populations. Proceedings of the National Academy of Sciences, 103(9):3490–3494, 2006.
  48. J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. Diploma thesis, Institut für Informatik, Technische Universität München, 1987.
  49. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  50. Y. Shoham and K. Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.
  51. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.
  52. R. S. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, 2018.
  53. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1999.
  54. M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In International Conference on Machine Learning, 1993.
  55. Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia. arXiv preprint arXiv:2312.03664, 2023.
  56. J. von Neumann and O. Morgenstern. Theory of games and economic behavior. Princeton University Press, 1947.
  57. Uncovering mesa-optimization algorithms in Transformers. arXiv preprint arXiv:2309.05858, 2023.
  58. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  59. COLA: consistent learning with opponent-learning awareness. In International Conference on Machine Learning, 2022.
  60. Learning latent representations to influence multi-agent interaction. In Conference on Robot Learning, 2021.
  61. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384, 2021.
  62. Proximal learning with opponent-learning awareness. Advances in Neural Information Processing Systems, 35, 2022.
  63. K. J. Åström. Optimal control of Markov processes with incomplete state information I. Journal of Mathematical Analysis and Applications, 10:174–205, 1965.

Summary

  • The paper presents COALA-PG, a novel reinforcement learning algorithm that fosters cooperation by modeling agents' learning dynamics without higher-order derivatives.
  • It leverages sequence models and minibatch-friendly updates to outperform conventional methods in social dilemmas like the iterated prisoner's dilemma.
  • The approach reformulates multi-agent interactions as a meta-game, enabling agents to adapt strategies based on extended observation histories for scalable cooperation.

Multi-Agent Cooperation Through Learning-Aware Policy Gradients

The paper "Multi-agent cooperation through learning-aware policy gradients" (2410.18636) introduces a new reinforcement learning algorithm tailored for multi-agent environments, specifically addressing the fundamental challenges of cooperation among autonomous agents. This paper proposes an innovative policy gradient method free of higher-derivative computations, designed to foster cooperation in general-sum games through learning-awareness.

Key Contributions

Learning-Aware Policy Gradient

The paper offers a novel algorithm termed Co-agent Learning-Aware Policy Gradients (COALA-PG), which diverges from traditional approaches requiring higher-order derivatives. COALA-PG applies a sequence model to condition agents' policies based on extensive observation histories, effectively modeling the learning dynamics of other agents. Figure 1

Figure 1: Experience data terminology illustrating game play episodes and meta-trajectories comprising meta-learning contexts.

Efficiency and Applicability

COALA-PG introduces several desirable properties:

  • Derivative-Free: Does not necessitate higher-order derivatives.
  • Unbiased: Provides an unbiased policy gradient estimator.
  • Minibatch-friendly: Capable of modeling minibatched learning algorithms.
  • Scalability: Suitable for recurrent sequence policy models, extending application to complex architectures and environments.

Experimental Insights

COALA-PG significantly outperforms existing methods in scenarios involving sequential social dilemmas and iterated prisoner's dilemma, showcasing its capability to establish cooperation in environments demanding extended temporal coordination. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Learning-aware agents extort naive learners, transitioning into cooperation when playing against other learning-aware agents.

Framework and Methodology

The paper formulates a new meta-game paradigm, incorporating agents' learning algorithms as part of the environmental variables. This transforms the problem into a single-agent partial observable MDP, facilitating the use of standard reinforcement learning techniques. The meta-agent, aware of co-player learning dynamics, deploys strategic policy updates by observing meta-trajectories. Figure 3

Figure 3: Policy update and credit assignment contrasting naive and meta agents.

The Iterated Prisoner's Dilemma Analysis

In analytical settings, COALA-PG reveals mechanisms through which learning-aware agents extort naive learners. In heterogeneous agent populations, these agents converge on cooperative strategies, showcasing the importance of population diversity in overcoming social dilemmas.

Practical Implications

The implications for practical AI development are substantial:

  • Autonomous Systems: Enhances interactions in multi-agent systems, applicable to robotics, drones, and automated negotiations.
  • Distributed AI Frameworks: Facilitates decentralized learning models for real-time cooperative tasks.
  • Economic and Game Theory: Contributes to the understanding of cooperative strategies in competitive environments. Figure 4

Figure 4

Figure 4

Figure 4: COALA-PG-trained agents demonstrate superior shaping of naive opponents in the CleanUp environment, optimizing cooperation strategies.

Conclusion

"Multi-agent cooperation through learning-aware policy gradients" contributes a pivotal advancement in multi-agent reinforcement learning, promoting cooperation through learning-awareness mechanisms. Future research directions may explore scalable implementations in diverse real-world environments, assessing broader implications in AI and machine cooperation. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: Agents trained with COALA-PG in CleanUp-lite exemplifying cooperative behavior and enhanced resource management.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 26 likes about this paper.