TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

Published 18 Aug 2025 in cs.IR, cs.AI, cs.MM, cs.SD, and eess.AS | (2509.09685v4)

Abstract: We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In the proposed pipeline, multiple LLM agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are released at https://talkpl.ai/talkplaydata2.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an agentic architecture using specialized LLMs to generate synthetic, multimodal conversational datasets for music recommendation.
It employs an iterative 8-turn dialogue construction method combined with rigorous evaluation metrics such as goal relevancy and linguistic quality.
The dataset, comprising 16.2k conversations, enhances recommender system training and paves the way for future research in complex conversational strategies.

Detailed Analysis of "TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation"

Introduction

"TalkPlayData 2" introduces a novel synthetic data pipeline designed to generate multimodal conversational datasets suitable for music recommendation. Utilizing multiple LLMs, each with a specialized role, the pipeline captures diverse scenarios and produces realistic multimodal dialogue data. This approach enhances the capability of recommender systems to understand and converse in natural language across various modalities, such as text, audio, and images.

Data Generation Architecture

The core innovation of TalkPlayData 2 lies in its agentic architecture, where each component LLM assumes a specific role: the Listener Profile LLM, Conversation Goal LLM, Listener LLM, and Recsys LLM.

Multimodal LLM Agents

Figure 1: Overview of TalkPlayData 2 pipeline, consisting of four LLMs with specialized roles.

Listener Profile LLM: Analyzes user demographic information combined with a preselected set of music tracks to infer a listener's musical preferences. This rich user profile feeds into the subsequent conversation, tailoring the dialogue to user tastes.
Conversation Goal LLM: Prescribes conversation goals based on a pre-existing template, adapted to suit the available recommendation pool of music tracks. These goals encourage variety within generated dialogues, challenging recommendation systems with real-world conversational contexts.
Listener LLM and Recsys LLM: Engage in conversation simulating user-recommender interactions. The Listener LLM, guided by its goals, provides feedback on recommendations while the Recsys LLM matches queries against a scoped music library, ensuring a goal-led recommendation process.

Data Creation Mechanism

A unique characteristic of the dataset is its reliance on multimodal engagement. The Recsys LLM accesses full music profiles including audio and visual cues, and responds with checks against a user-defined goal—all while maintaining realistic discourse.

Profiling and Goal Generation: This phase leverages randomized sampling and LLM-driven customization to ensure user profiles and conversational goals are tailored, consistent, and contextually relevant.
Iterative Conversation Construction: The system simulates an 8-turn interaction with dynamic content exchange. The iterative refinement at each turn ensures recommendations build logically on the prior dialogue states.

Evaluation Metrics and Results

TalkPlayData 2's success is underlined by rigorous evaluations. The dataset yields 16.2k conversations, each rated on criteria like goal relevancy and linguistic quality by advanced LLM evaluations and human judges.

Statistical Alignment: The dataset exhibits high variability in specificity and categorization, providing comprehensive coverage of potential conversational scenarios.
Evaluation Findings: Human evaluators rated TalkPlayData 2 higher in naturalness and relevance compared to previous datasets, reflecting the efficacy of its multi-agent framework.

Implications and Future Perspectives

TalkPlayData 2 addresses several limitations in current conversational recommendation datasets by offering nuanced, multimodal scenarios that mimic real-world user interaction patterns. The dataset's open sourcing facilitates broader adoption and experimentation in developing more refined and adaptable recommendation systems.

Future avenues for research include expanding the dataset's use cases to support longer audio snippets, more intricate modal interactions, and more complex conversational strategies. This expansion could significantly enhance the robustness and adaptability of AI in multi-context dialogue systems.

Conclusion

TalkPlayData 2 represents a significant step forward in synthesizing realistic, varied conversational datasets for music recommendation. Its innovative use of specialized LLM agents and multimodal inputs allows for a deeply nuanced training environment, pushing the boundaries of what conversational AI can achieve in real-world recommendation applications.

Markdown Report Issue