InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
Abstract: In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .
First 10 authors:
- Anthropic. 2023. Claude-2.1. https://www.anthropic.com/index/claude-2-1.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Daniel Covington. 2016. Analytics: Data Science, Data Analysis, and Predictive Analytics for Business. CreateSpace Independent Publishing Platform.
- DeepSeek. 2023. Deepseek coder: Let the code write itself. https://github.com/deepseek-ai/DeepSeek-Coder.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering.
- Melissa A Hardy and Alan Bryman. 2004. Handbook of data analysis.
- Measuring coding challenge competence with apps.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Gary Koop. 2022. Analysis of financial data. John Wiley & Sons Inc.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
- Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343.
- “what it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–31.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664.
- Killian Lucas. 2023. Open interpreter. https://github.com/KillianLucas/open-interpreter.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
- Docvqa: A dataset for vqa on document images.
- MiniMax. 2023. Abab5.5. https://api.minimax.chat/.
- Mistral.ai. 2023. Mistral. https://mistral.ai/product/.
- Yohei Nakajima. 2023. Babyagi. https://github.com/yoheinakajima/babyagi.
- OpenAI. 2023a. Gpt-4 technical report.
- OpenAI. 2023b. Openai models - openai api. https://platform.openai.com/docs/models/gpt-3-5.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
- Phind. 2023. Phind code llama. https://www.phind.com/blog/code-llama-beats-gpt4.
- Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541.
- Tool learning with foundation models.
- Toolllm: Facilitating large language models to master 16000+ real-world apis.
- Chandan K Reddy and Charu C Aggarwal. 2015. Healthcare data analytics, volume 36. CRC Press.
- Reworkd. 2023. Agentgpt. https://github.com/reworkd/AgentGPT.
- Code llama: Open foundation models for code.
- Leonelli Sabina and Edward N Zalta. 2020. Scientific research and big data. The Stanford Encyclopedia of Philosophy (Summer 2020 Edition).
- Adaplanner: Adaptive planning from feedback with language models. arXiv preprint arXiv:2305.16653.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Xwin-LM Team. 2023b. Xwin-lm.
- Torantulino. 2023. Autogpt. https://github.com/Significant-Gravitas/AutoGPT.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
- Large language models are not fair evaluators.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
- Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634.
- Gentopia: A collaborative platform for tool-augmented llms.
- Rewoo: Decoupling reasoning from observations for efficient augmented language models. arXiv preprint arXiv:2305.18323.
- Baichuan 2: Open large-scale language models.
- If llm is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents. arXiv preprint arXiv:2401.00812.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- React: Synergizing reasoning and acting in language models.
- Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
- Agenttuning: Enabling generalized agent abilities for llms.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Mobile-env: An evaluation platform and benchmark for interactive agents in llm era.
- Memory-augmented llm personalization with short-and long-term memory coordination. arXiv preprint arXiv:2309.11696.
- Unifying the perspectives of nlp and software engineering: A survey on language models for code.
- An in-depth survey of large language model-based artificial intelligence agents. arXiv preprint arXiv:2309.14365.
- Webarena: A realistic web environment for building autonomous agents.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.