Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reinforcing Language Agents via Policy Optimization with Action Decomposition

Published 23 May 2024 in cs.AI and cs.LG | (2405.15821v1)

Abstract: LLMs as intelligent agents push the boundaries of sequential decision-making agents but struggle with limited knowledge of environmental dynamics and exponentially huge action space. Recent efforts like GLAM and TWOSOME manually constrain the action space to a restricted subset and employ reinforcement learning to align agents' knowledge with specific environments. However, they overlook fine-grained credit assignments for intra-action tokens, which is essential for efficient language agent optimization, and rely on human's prior knowledge to restrict action space. This paper proposes decomposing language agent optimization from the action level to the token level, offering finer supervision for each intra-action token and manageable optimization complexity in environments with unrestricted action spaces. Beginning with the simplification of flattening all actions, we theoretically explore the discrepancies between action-level optimization and this naive token-level optimization. We then derive the Bellman backup with Action Decomposition (BAD) to integrate credit assignments for both intra-action and inter-action tokens, effectively eliminating the discrepancies. Implementing BAD within the PPO algorithm, we introduce Policy Optimization with Action Decomposition (POAD). POAD benefits from a finer-grained credit assignment process and lower optimization complexity, leading to enhanced learning efficiency and generalization abilities in aligning language agents with interactive environments. We validate POAD across diverse testbeds, with results affirming the advantages of our approach and the correctness of our theoretical analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  2. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  3. Chessgpt: Bridging policy learning and language modeling. Advances in Neural Information Processing Systems, 36, 2024.
  4. Grounding large language models in interactive environments with online reinforcement learning. In International Conference on Machine Learning, pages 3676–3713. PMLR, 2023.
  5. True knowledge comes from practice: Aligning large language models with embodied environments via reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hILVmJ4Uvu.
  6. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
  7. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  8. Reinforcement learning: An introduction. MIT press, 2018.
  9. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  10. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024a.
  11. From r to q* : Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024b.
  12. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  13. Markov decision processes. Markov Decision Processes in Artificial Intelligence, pages 1–38, 2013.
  14. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017a.
  15. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  16. Inner monologue: Embodied reasoning through planning with language models. In 6th Annual Conference on Robot Learning, 2022.
  17. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  18. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020.
  19. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations, 2023.
  20. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022.
  21. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  22. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  23. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  24. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  25. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  26. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  27. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  28. Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35:16509–16521, 2022a.
  29. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  30. On realization of intelligent decision-making in the real world: A foundation decision model perspective. arXiv preprint arXiv:2212.12669, 2022b.
  31. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In Conference on Robot Learning, pages 3909–3928. PMLR, 2023.
  32. Settling the variance of multi-agent policy gradients. Advances in Neural Information Processing Systems, 34:13458–13470, 2021a.
  33. Trust region policy optimisation in multi-agent reinforcement learning. In International Conference on Learning Representations, 2021b.
  34. Reinforcement learning and markov decision processes. In Reinforcement learning: State-of-the-art, pages 3–42. Springer, 2012.
  35. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  36. Pangu-agent: A fine-tunable generalist agent with structured reasoning. arXiv preprint arXiv:2312.14878, 2023.
  37. Improving large language models via fine-grained reinforcement learning with minimum editing constraint. arXiv preprint arXiv:2401.06081, 2024.
  38. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963, 2024.
  39. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  40. Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:1611.06256, 2016.
  41. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  42. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  43. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017b.
  44. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  45. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  46. Llama 2: Open foundation and fine-tuned chat models, 2023.
  47. Code llama: Open foundation models for code, 2023.
  48. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  49. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
  50. Sarang Narkhede. Understanding auc-roc curve. Towards data science, 26(1):220–227, 2018.
  51. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  52. Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3:747–769, 2021.
  53. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, page 8, 2021.
  54. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  55. Generalized hindsight for reinforcement learning. Advances in neural information processing systems, 33:7754–7767, 2020.
Citations (5)

Summary

  • The paper introduces BAD to decompose action-level credit assignments into token-level, refining supervision and improving training efficiency.
  • It integrates BAD with PPO to form POAD, yielding significant performance gains and robust learning in complex, interactive environments.
  • Experimental validations in Overcooked, VirtualHome, and DataSciCoding tasks demonstrate faster convergence, enhanced stability, and better generalization in unrestricted action spaces.

Reinforcing Language Agents via Policy Optimization with Action Decomposition

Introduction

The paper introduces a novel approach to optimizing LLMs as intelligent agents in interactive environments, focusing on overcoming misalignment issues and inefficiencies in action space dynamics. Previous methods constrained the action space, utilizing reinforcement learning (RL) to align agents with their environment; however, they did not address the precise credit assignment for intra-action tokens. This paper proposes decomposing the policy optimization process from the action level to the token level to offer finer supervision, hence improving optimization complexity and efficiency without human-imposed action space restrictions.

Methodology: Action Decomposition and Bellman Backup

The authors propose a method termed Bellman backup with Action Decomposition (BAD), which integrates credit assignments for both intra-action and inter-action tokens. By doing so, they address the discrepancies between naive token-level and action-level optimization. Through theoretical exploration, BAD is formulated to ensure token-level training is consistent with maximizing action utilities, pivotal for aligning language agents with environments even with unrestricted action spaces. Figure 1

Figure 1: Visual comparison of the differences between action-level Bellman backup (left) and BAD (right), showing equivalence when optimized with the Q-function for tokens.

BAD's integration into Proximal Policy Optimization (PPO) results in Policy Optimization with Action Decomposition (POAD), which significantly enhances learning efficiency and generalization of language agents. The method benefits from a reduction in optimization complexity by transforming an intractable action space into manageable components, thus improving training efficiency.

Experimental Validation

The authors validate their approach in environments with both restricted and unrestricted action spaces, such as Overcooked, VirtualHome, and a newly constructed DataSciCoding task. Results demonstrate POAD's advantages over existing methods like TWOSOME through enhanced performance efficiency and robustness. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Performance comparisons on Overcooked (first two) and VirtualHome (last two) indicating POAD's superior efficiency and robustness.

POAD significantly outperforms naive token optimization (NTPO) and existing baselines regarding convergence speed and stability, illustrating the theoretical correctness and empirical benefits of BAD. The experiments consistently show POAD’s enhanced performance, particularly in environments with unrestricted action spaces, where traditional methods fall short. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Comparative performance analysis between POAD and baseline methods on DataSciCoding benchmarks, highlighting superior outcomes achieved by POAD.

Implications and Future Developments

Key implications of this study include the potential for POAD to improve generalization in unseen tasks without compromising the intrinsic linguistic capabilities of LLMs. The research paves the way for advancements in language agents, suggesting broader applications across diverse environments and interactive settings.

Furthermore, by focusing on token-level optimization while maintaining consistency with action utility maximization, LLMs can achieve refined decision-making capabilities in complex, dynamic environments. Future developments may explore incorporating techniques such as self-rewarding systems or hindsight relabeling to further enhance adaptability and learning efficiency in environments lacking predefined reward structures.

Conclusion

The paper presents an innovative approach to reinforcing language agents via policy optimization, demonstrating the efficacy and applicability of BAD and POAD in diverse interactive environments. Through comprehensive theoretical analysis and empirical validation, the work establishes a refined method for training LLMs, emphasizing the importance of precise credit assignment and action decomposition. These contributions offer significant insights into the development of sophisticated language agent models, poised for broader deployment in real-world tasks.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.