ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy

Published 21 Mar 2024 in cs.AI, cs.CL, and cs.LG | (2403.14589v3)

Abstract: Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotation or implementations of diverse prompting frameworks. In this work, we propose A$^3$T, a framework that enables the Autonomous Annotation of Agent Trajectories in the style of ReAct. The central role is an ActRe prompting agent, which explains the reason for an arbitrary action. When randomly sampling an external action, the ReAct-style agent could query the ActRe agent with the action to obtain its textual rationales. Novel trajectories are then synthesized by prepending the posterior reasoning from ActRe to the sampled action. In this way, the ReAct-style agent executes multiple trajectories for the failed tasks, and selects the successful ones to supplement its failed trajectory for contrastive self-training. Realized by policy gradient methods with binarized rewards, the contrastive self-training with accumulated trajectories facilitates a closed loop for multiple rounds of language agent self-improvement. We conduct experiments using QLoRA fine-tuning with the open-sourced Mistral-7B-Instruct-v0.2. In AlfWorld, the agent trained with A$^3$T obtains a 1-shot success rate of 96%, and 100% success with 4 iterative rounds. In WebShop, the 1-shot performance of the A$^3$T agent matches human average, and 4 rounds of iterative refinement lead to the performance approaching human experts. A$^3$T agents significantly outperform existing techniques, including prompting with GPT-4, advanced agent frameworks, and fully fine-tuned LLMs.

Abstract PDF HTML Upgrade to Chat

References (39)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces the A3T framework, enabling autonomous annotation of agent trajectories to reduce reliance on manual data collection.
The paper employs contrastive self-training using policy gradient methods with binarized rewards to refine language agent performance.
The paper demonstrates high efficacy with a 96% success rate on unseen tasks, reaching 100% through iterative refinement on benchmarks like AlfWorld and WebShop.

ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy

Introduction

The paper introduces A $^3$ T, a framework designed to enable autonomous annotation of agent trajectories in the style of ReAct for self-improvement with minimal human supervision. Traditional methods for collecting training data for language agents rely heavily on human annotation or multiple implementations of agent frameworks, limiting scalability and diversity. A $^3$ T leverages an ActRe prompting agent that provides rationales for sampled actions, creating a novel method for generating training trajectories autonomously.

Figure 1: Upper: Two common paradigms to collect trajectories for language agents. (a) Trajectories are artificially annotated as human demonstrations. (b) Trajectories are gathered by deploying policy agents that reason and act in the language form. However, both paradigms require considerable human effort in either data annotation or different implementations of agent frameworks, thus lacking scalability in the data collection process. Lower: (c) Our A $^\mathbf{3}$ T framework. A $^\mathbf{3}$ T enables the Autonomous Annotation of Agent Trajectories in ReAct style with an ActRe agent, facilitating the closed loop of contrastive self-training.

A $^3$ T Framework

Autonomous Trajectory Annotation with ActRe

The framework uses an ActRe agent to automatically generate reasons for actions taken during exploration. This inversion of the ReAct paradigm allows the ActRe agent to provide a posterior explanation for sampled actions, facilitating the creation of new, diverse trajectories without manual intervention. When a language agent samples an external action, the ActRe prompts the agent to synthesize the reason, which is then used to condition subsequent actions.

Figure 2: Trajectory comparison on the 518-th task of WebShop. (a) the failed trajectory by the trained agent at the 0-th Round; (b) the composed trajectory assisted with ActRe. The trained policy agent fails to choose the correct option in the item content page. Success is obtained by clicking "[white | primary]", and ActRe annotates for the sampled action.

Contrastive Self-Training

The training process uses policy gradient methods with binarized rewards to fine-tune LLMs on the collected trajectories. By comparing failed and successful trajectories within the same task, the framework leverages the contrast to improve agent performance. The use of binarized rewards enhances learning stability and encourages better alignment with task requirements.

Experimental Evaluation

AlfWorld experiments demonstrate that the A $^3$ T framework achieves a 96% success rate with a single trial on unseen tasks. The framework's iterative refinement improves performance to 100%, significantly outperforming existing state-of-the-art methods. The training process in AlfWorld uses 600 out of 3,553 possible tasks, showing the framework's efficiency with limited data.

Figure 3: Trajectory comparison on the 512-th task of WebShop. (a) the failed trajectory by the trained agent at the 0-th Round; (b) the annotated trajectory assisted with ActRe. The action of clicking the 7-th item "B09QCYNF9R" is explored in the new trajectory, and the reason highlighted in green is generated by ActRe.

WebShop experiments further illustrate the framework's efficacy, where A $^3$ T matches average human performance in single-shot scenarios and approaches human expert performance with iterative trials. The use of 2,300 training tasks out of 11,587 shows the framework's ability to generalize efficiently.

Discussion

The A $^3$ T framework provides a scalable solution for training language agents with minimal human intervention. The use of ActRe efficiently augments the diversity and quality of training data, while policy gradient methods enhance learning through contrastive self-training. Future developments could focus on integrating more sophisticated RL algorithms to further push the boundaries of autonomous agent training.

Conclusion

A $^3$ T fundamentally improves the scalability and effectiveness of language agent training by removing the dependency on human-crafted datasets. This work sets a new standard in the autonomous training of language agents, inviting further research to explore the integration of advanced reinforcement learning techniques in similar frameworks.