Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy

Published 21 Mar 2024 in cs.AI, cs.CL, and cs.LG | (2403.14589v3)

Abstract: Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotation or implementations of diverse prompting frameworks. In this work, we propose A$3$T, a framework that enables the Autonomous Annotation of Agent Trajectories in the style of ReAct. The central role is an ActRe prompting agent, which explains the reason for an arbitrary action. When randomly sampling an external action, the ReAct-style agent could query the ActRe agent with the action to obtain its textual rationales. Novel trajectories are then synthesized by prepending the posterior reasoning from ActRe to the sampled action. In this way, the ReAct-style agent executes multiple trajectories for the failed tasks, and selects the successful ones to supplement its failed trajectory for contrastive self-training. Realized by policy gradient methods with binarized rewards, the contrastive self-training with accumulated trajectories facilitates a closed loop for multiple rounds of language agent self-improvement. We conduct experiments using QLoRA fine-tuning with the open-sourced Mistral-7B-Instruct-v0.2. In AlfWorld, the agent trained with A$3$T obtains a 1-shot success rate of 96%, and 100% success with 4 iterative rounds. In WebShop, the 1-shot performance of the A$3$T agent matches human average, and 4 rounds of iterative refinement lead to the performance approaching human experts. A$3$T agents significantly outperform existing techniques, including prompting with GPT-4, advanced agent frameworks, and fully fine-tuned LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Rest meets react: Self-improvement for multi-step reasoning llm agent. arXiv preprint arXiv:2312.10003, 2023.
  2. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023.
  3. Self-play fine-tuning converts weak language models to strong language models, 2024.
  4. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  5. Multimodal web navigation with instruction-finetuned foundation models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=efFmBWioSc.
  6. V-star: Training verifiers for self-taught reasoners. arXiv preprint arXiv:2402.06457, 2024.
  7. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  8. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  9. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  10. Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency. arXiv preprint arXiv:2309.17382, 2023.
  11. Agentlite: A lightweight library for building and advancing task-oriented llm agent system. arXiv preprint arXiv:2402.15538, 2024.
  12. Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178, 2024.
  13. Language models are few-shot butlers. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  9312–9318, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.734. URL https://aclanthology.org/2021.emnlp-main.734.
  14. Large language models as general pattern machines. In Proceedings of the 7th Conference on Robot Learning (CoRL), 2023.
  15. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  16. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  17. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  18. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023.
  19. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=vAElhFcKW6.
  20. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://arxiv.org/abs/2010.03768.
  21. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  22. Trial and error: Exploration-based trajectory optimization for llm agents. arXiv preprint arXiv:2403.02502, 2024.
  23. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  25. Openchat: Advancing open-source language models with mixed-quality data. In The Twelfth International Conference on Learning Representations, 2024.
  26. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023.
  27. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  28. Os-copilot: Towards generalist computer agents with self-improvement, 2024.
  29. Towards unified alignment between agents, humans, and environment. arXiv preprint arXiv:2402.07744, 2024.
  30. Webshop: Towards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  20744–20757. Curran Associates, Inc., 2022.
  31. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X.
  32. Agent lumos: Unified and modular training for open-source language agents. arXiv preprint arXiv:2311.05657, 2023.
  33. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  34. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
  35. Agentohana: Design unified data and training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506, 2024.
  36. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  37. Language agent tree search unifies reasoning acting and planning in language models, 2023a.
  38. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2023b.
  39. Archer: Training language model agents via hierarchical multi-turn rl. arXiv preprint arXiv:2402.19446, 2024.
Citations (5)

Summary

  • The paper introduces the A3T framework, enabling autonomous annotation of agent trajectories to reduce reliance on manual data collection.
  • The paper employs contrastive self-training using policy gradient methods with binarized rewards to refine language agent performance.
  • The paper demonstrates high efficacy with a 96% success rate on unseen tasks, reaching 100% through iterative refinement on benchmarks like AlfWorld and WebShop.

ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy

Introduction

The paper introduces A3^3T, a framework designed to enable autonomous annotation of agent trajectories in the style of ReAct for self-improvement with minimal human supervision. Traditional methods for collecting training data for language agents rely heavily on human annotation or multiple implementations of agent frameworks, limiting scalability and diversity. A3^3T leverages an ActRe prompting agent that provides rationales for sampled actions, creating a novel method for generating training trajectories autonomously. Figure 1

Figure 1: Upper: Two common paradigms to collect trajectories for language agents. (a) Trajectories are artificially annotated as human demonstrations. (b) Trajectories are gathered by deploying policy agents that reason and act in the language form. However, both paradigms require considerable human effort in either data annotation or different implementations of agent frameworks, thus lacking scalability in the data collection process. Lower: (c) Our A3^\mathbf{3}T framework. A3^\mathbf{3}T enables the Autonomous Annotation of Agent Trajectories in ReAct style with an ActRe agent, facilitating the closed loop of contrastive self-training.

A3^3T Framework

Autonomous Trajectory Annotation with ActRe

The framework uses an ActRe agent to automatically generate reasons for actions taken during exploration. This inversion of the ReAct paradigm allows the ActRe agent to provide a posterior explanation for sampled actions, facilitating the creation of new, diverse trajectories without manual intervention. When a language agent samples an external action, the ActRe prompts the agent to synthesize the reason, which is then used to condition subsequent actions. Figure 2

Figure 2: Trajectory comparison on the 518-th task of WebShop. (a) the failed trajectory by the trained agent at the 0-th Round; (b) the composed trajectory assisted with ActRe. The trained policy agent fails to choose the correct option in the item content page. Success is obtained by clicking "[white | primary]", and ActRe annotates for the sampled action.

Contrastive Self-Training

The training process uses policy gradient methods with binarized rewards to fine-tune LLMs on the collected trajectories. By comparing failed and successful trajectories within the same task, the framework leverages the contrast to improve agent performance. The use of binarized rewards enhances learning stability and encourages better alignment with task requirements.

Experimental Evaluation

AlfWorld experiments demonstrate that the A3^3T framework achieves a 96% success rate with a single trial on unseen tasks. The framework's iterative refinement improves performance to 100%, significantly outperforming existing state-of-the-art methods. The training process in AlfWorld uses 600 out of 3,553 possible tasks, showing the framework's efficiency with limited data. Figure 3

Figure 3: Trajectory comparison on the 512-th task of WebShop. (a) the failed trajectory by the trained agent at the 0-th Round; (b) the annotated trajectory assisted with ActRe. The action of clicking the 7-th item "B09QCYNF9R" is explored in the new trajectory, and the reason highlighted in green is generated by ActRe.

WebShop experiments further illustrate the framework's efficacy, where A3^3T matches average human performance in single-shot scenarios and approaches human expert performance with iterative trials. The use of 2,300 training tasks out of 11,587 shows the framework's ability to generalize efficiently.

Discussion

The A3^3T framework provides a scalable solution for training language agents with minimal human intervention. The use of ActRe efficiently augments the diversity and quality of training data, while policy gradient methods enhance learning through contrastive self-training. Future developments could focus on integrating more sophisticated RL algorithms to further push the boundaries of autonomous agent training.

Conclusion

A3^3T fundamentally improves the scalability and effectiveness of language agent training by removing the dependency on human-crafted datasets. This work sets a new standard in the autonomous training of language agents, inviting further research to explore the integration of advanced reinforcement learning techniques in similar frameworks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.