Cloning a Conversational Voice AI Agent from Call\,Recording Datasets for Telesales

Published 5 Sep 2025 in cs.AI and cs.LG | (2509.04871v1)

Abstract: Recent advances in language and speech modelling have made it possible to build autonomous voice assistants that understand and generate human dialogue in real time. These systems are increasingly being deployed in domains such as customer service and healthcare care, where they can automate repetitive tasks, reduce operational costs, and provide constant support around the clock. In this paper, we present a general methodology for cloning a conversational voice AI agent from a corpus of call recordings. Although the case study described in this paper uses telesales data to illustrate the approach, the underlying process generalizes to any domain where call transcripts are available. Our system listens to customers over the telephone, responds with a synthetic voice, and follows a structured playbook learned from top performing human agents. We describe the domain selection, knowledge extraction, and prompt engineering used to construct the agent, integrating automatic speech recognition, a LLM based dialogue manager, and text to speech synthesis into a streaming inference pipeline. The cloned agent is evaluated against human agents on a rubric of 22 criteria covering introduction, product communication, sales drive, objection handling, and closing. Blind tests show that the AI agent approaches human performance in routine aspects of the call while underperforming in persuasion and objection handling. We analyze these shortcomings and refine the prompt accordingly. The paper concludes with design lessons and avenues for future research, including large scale simulation and automated evaluation.

Abstract PDF Upgrade to Chat

Summary

The paper presents a dual-pipeline method that extracts call recordings to build an Agent Playbook for effective AI-driven sales strategies.
It demonstrates that iterative prompt engineering significantly improves objection handling capabilities and narrows performance gaps with human agents.
Evaluation using a detailed rubric shows that while the AI excels in routine interactions, it still struggles with complex persuasion, advocating a human-AI collaboration.

Cloning a Conversational Voice AI Agent from Call Recording Datasets for Telesales

Introduction

The paper "Cloning a Conversational Voice AI Agent from Call Recording Datasets for Telesales" (2509.04871) outlines a systematic approach to developing an AI voice agent capable of handling telesales through the analysis of call transcripts. It leverages recent advancements in speech and LLMs to automate routine interactions, thereby aiming to reduce labor costs and enhance efficiency. Primarily, the methodology focuses on transforming recorded conversations into a structured prompt that guides an AI agent—here referred to as the Agent Playbook—by distilling essential conversational strategies employed by adept human agents.

Methodology

The approach consists of a dual-pipeline system, dividing into knowledge extraction from call recordings and deploying the learned behaviors into real-time conversations. The cloning system encompasses sampling, ranking, and transforming high-quality call interactions into a comprehensive system prompt. This prompt encapsulates the agent’s identity, persona, strategies for objection handling, and product information, thereby serving as a blueprint for AI to mimic effective sales strategies.

Figure 1: Overview of the cloning system. Call recordings are sampled and ranked to identify high-quality examples. Knowledge is extracted and organized by topic into a manual, while representative dialogues are drafted. These artifacts are then composed into a system prompt that defines the agent’s role, persona, and conversation strategy called the Agent Playbook.

The inference system utilizes the Gemini Live API to facilitate real-time dialogue generation, enabling seamless integration of speech recognition and synthesis.

Evaluation

The AI agent’s performance was rigorously tested against human agents using a carefully formulated rubric across multiple scenarios. This evaluation involved assessing the agent’s ability to perform tasks related to introduction, product communication, objection handling, and closing. The initial results indicated that the AI was competent in routine aspects, matching human agents in certain criteria while lagging in objection handling and persuasive skills.

Figure 2: Initial evaluation results comparing the AI agent to human agents. Scores are averaged across seven evaluators for each scenario. The AI (blue) approaches human performance (grey) in introduction and product communication but underperforms in objection handling and closing in more challenging scenarios. Error bars indicate standard deviation.

Improvements and Results

Upon analysis, the prompt was refined to address identified weaknesses. This involved clarifying objectives, adjusting the language to focus on crucial conversational aspects, and enhancing the examples provided in the prompt. Subsequent evaluations showed significant improvement in objection handling and the AI’s ability to guide conversations towards closing.

Figure 3: Evaluation results after prompt optimization and fine-tuning (AI agent V2). The AI’s scores (green) show marked improvement, particularly in objection handling and salesmanship, closing much of the gap to the human benchmarks.

Conclusion

The research demonstrates the feasibility of developing an effective AI telesales agent via targeted prompt engineering and strategic fine-tuning, without extensive training from scratch. Although the agent excels in routine interactions, challenges remain in matching human proficiency in complex conversation dynamics. The study suggests that AI voice agents should supplement human agents rather than fully replace them, ensuring that the human element remains integral to customer interactions. Future research directions include large-scale simulations, integrating retrieval-augmented generation, and considering emotional responsiveness of the AI to enhance its effectiveness further.