SmartPlay: A Benchmark for LLMs as Intelligent Agents
Abstract: Recent LLMs have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately. SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/Microsoft/SmartPlay
- Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Textworld: A learning environment for text-based games. In Workshop on Computer Games, pp. 41–75. Springer, 2018.
- Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pp. 41–75. Springer, 2019.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853, 2022.
- General game playing: Overview of the aaai competition. AI magazine, 26(2):62–62, 2005.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- The minerl 2020 competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:2101.11071, 2021.
- Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Grounding language to entities and dynamics for generalization in reinforcement learning. In International Conference on Machine Learning, pp. 4051–4062. PMLR, 2021.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
- Raph Koster. Theory of fun for game design. ” O’Reilly Media, Inc.”, 2013.
- The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020.
- Soar: An architecture for general intelligence. Artificial intelligence, 33(1):1–64, 1987.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
- James Manyika. An overview of bard: an early experiment with generative ai. https://ai.google/static/documents/google-about-bard.pdf. Accessed: May 27, 2023.
- Language models are few-shot butlers. arXiv preprint arXiv:2104.07972, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
- Barney Pell. Strategy generation and evaluation for meta-game playing. KI-Künstliche Intelligenz, 25(1):71–72, 2011.
- Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
- Reworkd. reworkd/agentgpt: Assemble, configure, and deploy autonomous ai agents in your browser. URL https://github.com/reworkd/AgentGPT.
- Stuart J Russell. Artificial intelligence a modern approach. Pearson Education, Inc., 2010.
- Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9339–9347, 2019.
- Measuring intelligence through games. arXiv preprint arXiv:1109.1314, 2011.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740–10749, 2020a.
- Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020b.
- Significant-Gravitas. Significant-gravitas/auto-gpt: An experimental open-source attempt to make gpt-4 fully autonomous. URL https://github.com/Significant-Gravitas/Auto-GPT.
- Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. CoRR, abs/2201.11990, 2022. URL https://arxiv.org/abs/2201.11990.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022a.
- Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pp. 477–490. PMLR, 2022b.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/pii/S2665963820300099.
- Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023b.
- Report on the 2008 reinforcement learning competition. AI Magazine, 31(2):81–81, 2010.
- Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp. 120–127. IEEE, 2011.
- Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152, 1995.
- Read and reap the rewards: Learning to play atari with the help of instruction manuals. arXiv preprint arXiv:2302.04449, 2023a.
- Plan, eliminate, and track–language models are good teachers for embodied agents. arXiv preprint arXiv:2305.02412, 2023b.
- Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arXiv:2305.15486, 2023c.
- Yoheinakajima. yoheinakajima/babyagi. URL https://github.com/yoheinakajima/babyagi.
- Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023b.
- Rtfm: Generalising to novel environment dynamics via reading. arXiv preprint arXiv:1910.08210, 2019.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.