ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents
Abstract: LLM-based agents are increasingly employed to interact with external environments (e.g., games, APIs, world models) to solve user-provided tasks. However, current frameworks often lack the ability to collaborate effectively with users in fully conversational settings. Conversations are essential for aligning on task details, achieving user-defined goals, and satisfying preferences. While existing agents address ambiguity through clarification questions, they underutilize the broader potential of an LLM's conversational capabilities. In this work, we introduce ReSpAct, an LLM-based agent designed to seamlessly integrate reasoning, decision-making, and dynamic dialogue for task-solving. Expanding on reasoning-first approaches like ReAct, ReSpAct employs active, free-flowing dialogues to interpret instructions, clarify goals, provide status updates, resolve subtask failures, and refine plans based on user inputs without any explicit dialogue schema. By alternating between task-solving actions and interactive conversations, ReSpAct demonstrates improved performance across diverse environments. We evaluate ReSpAct in user-interactive settings, including task-oriented dialogue systems (MultiWOZ) and decision-making tasks (ALFWorld, WebShop). ReSpAct outperforms ReAct with absolute success rate improvements of 6% and 4% in ALFWorld and WebShop, respectively, and achieves a 5.5% gain in Inform and a 3% gain in Success scores in MultiWOZ. These results highlight the value of integrating dynamic user-agent collaboration for more effective task resolution.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.
- Collaborative effort towards common ground in situated human-robot dialogue. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 33–40.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Just ask: An interactive learning framework for vision and language navigation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 2459–2466.
- Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75. Springer.
- Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models. arXiv preprint.
- Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint.
- Learning low-resource end-to-end goal-oriented dialog for fast and reliable system deployment. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 609–618.
- Think, act, and ask: Open-world interactive personalized robot navigation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3296–3303. IEEE.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
- On the multi-turn instruction following for conversational web agents. arXiv preprint arXiv:2402.15057.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
- Towards LLM-driven dialogue state tracking. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 739–755, Singapore. Association for Computational Linguistics.
- Dialog acts for task-driven embodied agents. arXiv preprint arXiv:2209.12953.
- Long text generation via adversarial training with leaked information. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- InstructDial: Improving zero and few-shot generalization in dialogue through instruction tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 505–525, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
- Vojtěch Hudeček and Ondrej Dusek. 2023. Are large language models all you need for task-oriented dialogue? In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 216–228, Prague, Czechia. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688.
- Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930.
- Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36.
- Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint.
- A framework for learning to request rich and contextually useful information from humans. In International Conference on Machine Learning, pages 16553–16568. PMLR.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
- Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
- Multi-task pre-training for plug-and-play task-oriented dialogue system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4661–4676, Dublin, Ireland. Association for Computational Linguistics.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2748–2763, Bangkok, Thailand. Association for Computational Linguistics.
- Webshop: Towards scalable real-world web interaction with grounded language agents. arXiv preprint arXiv:2207.01206.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- SGP-TOD: Building task bots effortlessly via schema-guided LLM prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13348–13369, Singapore. Association for Computational Linguistics.
- Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.