T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Published 22 May 2025 in cs.CL and cs.AI | (2505.16986v1)

Abstract: LLMs have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source LLMs. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

Abstract PDF Upgrade to Chat

Summary

The paper introduces T1, a 13.5k dialogue dataset spanning nine domains to benchmark LLM agent performance in multi-turn conversations requiring complex, interdependent tool use.
T1 provides an evaluation framework focused on assessing agent abilities in information seeking, parameter extraction, and generating executable tool calls based on dialogue context.
Experiments showed that the T1 dataset effectively highlights agent strengths and weaknesses, with task-specific fine-tuning and larger models like LLaMA 3.3 70B demonstrating improved performance in planning.

Overview of ` $black$ : A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

The paper introduces $black`, a robust dataset and evaluation framework designed to assess the performance of LLM-based agents in multi-turn, tool-using conversational settings. The focus is on assessing agents' abilities to manage complex dependencies between various tool calls over long conversational contexts. Despite advances in LLMs, effective planning involving API or tool dependencies across multi-turn dialogues remains challenging. The `$ black dataset serves as both a benchmark for evaluating open-source LLMs and a research facilitation tool for multi-domain conversational agents.

Dataset Description

$black</code> comprises 13.5k dialogues and spans nine distinct domains, including flights, restaurants, hotels, attractions, and various combinations of these, making it a comprehensive multi-domain dataset. The dataset incorporates 14 tools, allowing a detailed assessment of tool-driven dialogue tasks. This framework emphasizes cross-domain tasks, interdependent tool calls, and requires agents to consider tool selection and execution order within context.</p> <p>The dataset construction involved the creation of conversation templates which were then lexically filled with real-world data sourced from Wikipedia to maintain the context and realism. Entities included data on airports, cities, neighborhoods, and synthetic information for attributes like airline names and hotel star ratings.</p> <h3 class='paper-heading' id='technical-contributions'>Technical Contributions</h3> <p>The <code>$ black dataset's evaluation framework focuses on three main tasks: information seeking, parameter extraction, and tool calling.

Information Seeking: Assessing the agent's ability to query and gather necessary parameters for successful tool execution.
Parameter Extraction: Evaluating the agent's proficiency in extracting relevant parameters from the user’s dialogue.
Tool Calling: Determining the capacity of agents to generate executable code using predefined tool calls based on extracted parameters.

A caching mechanism is integrated to optimize performance by allowing agents to reuse previously retrieved information, thereby reducing computational costs and improving scalability.

Experimental Evaluation

Experiments conducted with several LLMs, including domain-adapted LLaMA models, demonstrated that the $black` dataset could effectively highlight strengths and weaknesses in handling multi-turn scenarios. The `$ black agent successfully showcased strong performance improvements over baseline models, particularly when fine-tuned with the dataset.

A notable observation was the impressive performance of LLaMA 3.3 70B Instruct in complex planning tasks, outperforming smaller models. The fine-tuned LLaMA 3.1 8B Instruct model also demonstrated significant gains, suggesting that task-specific fine-tuning is beneficial for enhancing model performance in complex conversational tool usage settings.

Implications and Future Directions

The $black</code> dataset fills a critical gap by providing a rigorous benchmark tailored to evaluating <a href="https://www.emergentmind.com/topics/llm-based-agents" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">LLM-based agents</a> in challenging dialogue settings involving tool dependency. It paves the way for future advancements in conversational AI, especially those focusing on effective utilization of toolsets within multi-turn, multi-domain contexts.</p> <p>The findings support the need for more complex evaluation frameworks that can test AI systems on dynamic replanning and adaptive reasoning—crucial for real-world applications. Furthermore, the public release of the dataset will facilitate future research efforts aimed at refining agentic LLM capabilities.</p> <p>In summary, the <code>$ black dataset represents a significant step towards understanding and enhancing the planning and reasoning capacities of LLMs when integrated with external tools. It will likely stimulate further innovation in intelligent agent design, potentially leading to more nuanced, context-aware AI systems.