PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Published 6 Jun 2025 in cs.CL, cs.AI, and cs.CV | (2506.06211v1)

Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces PuzzleWorld, a benchmark with 667 real-world puzzles for evaluating multimodal, open-ended reasoning in puzzlehunts.
It employs a dual-axis taxonomy and comprehensive human annotations to capture intermediate reasoning steps and diagnose model performance.
Findings reveal low final solution accuracy, with fine-tuning enhancing reasoning steps, highlighting challenges in visual and spatial tasks.

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

PuzzleWorld is introduced as an innovative benchmark designed to evaluate LLMs and multimodal systems in open-ended, multimodal reasoning tasks derived from puzzlehunts. This paper addresses the gap in AI benchmarking by providing a robust dataset that encompasses multimodal reasoning and the identification of implicit problem structures, challenging the limits of current state-of-the-art models.

PuzzleWorld Benchmark Design

PuzzleWorld consists of 667 real-world puzzles curated from the Puzzled Pint event series, specifically designed to test open-ended, multimodal reasoning capabilities. The puzzles demand solvers, both human and AI, to discern and interpret intricate cues distributed across multiple modalities, including text, visual, and structured data.

Figure 1: Overview of PuzzleWorld detailing the dataset from PuzzledPint and annotations by human annotators.

Each puzzle is annotated with detailed solution traces and cognitive skill labels, providing a comprehensive framework for evaluating AI models' step-by-step reasoning processes. This structure facilitates both holistic benchmarking and meticulous diagnostic assessments of various models' intermediate reasoning abilities.

Taxonomy and Dataset Construction

The paper proposes a dual-axis taxonomy system for puzzle categories based on input modalities and reasoning mechanisms. This framework enables a detailed assessment of models’ reasoning performance across different cognitive dimensions, including logical deduction, wordplay, and spatial reasoning.

Figure 2: Sample annotations illustrating the modalities and reasoning skills required for different PuzzleWorld puzzles.

The dataset construction process involves sourcing puzzles and human annotating reasoning steps from solution documents. This annotation allows for tracing the logical derivations and transformations required to arrive at puzzle solutions, offering insights into potential AI failure points and improvement areas.

Evaluation of AI Models on PuzzleWorld

The authors evaluate several state-of-the-art closed and open-source models, such as GPT-4o and Qwen QVQ, on the PuzzleWorld dataset. Findings reveal that existing models achieve low accuracy on final solutions, with the majority falling between 1-2%. The best-performing model, GPT-o3, solved only 14% of puzzles, highlighting the significant challenges posed by PuzzleWorld.

Figure 3: Dataset construction process and statistics showing the complexity and annotation details of puzzles.

Despite poor final answer performance, the detailed stepwise reasoning evaluation illuminates intermediate capabilities where models can still demonstrate partial understanding and logical follow-through, albeit failing to conclude correctly. Results indicate a pronounced dependence on textual strategies for reasoning, often a bottleneck in puzzles requiring visual and spatial manipulations.

Error Analysis and Improvement via Fine-tuning

Detailed error analyses reveal critical areas where models falter, such as visual misinterpretation and lack of sketching ability—key for tasks demanding manipulative and spatial reasoning skills. It is evident that models tend to exhibit myopic reasoning, failing to employ backtracking strategies when initial hypotheses are disproven.

Figure 4: Example puzzle errors illustrating myopic reasoning, language bottlenecks, and sketching errors.

Fine-tuning models using intermediate reasoning data from PuzzleWorld significantly enhances stepwise accuracy, demonstrating that improvements in detailed annotations and structured reasoning training can lead to better performance despite unchanged final answer accuracy.

Conclusion

PuzzleWorld represents a crucial advancement in creating more comprehensive benchmarks for evaluating complex, multimodal reasoning in AI. By challenging models with scenarios that mimic open-ended and creative human problem-solving, PuzzleWorld underscores the limitations of current AI capabilities and highlights pathways for developing more adaptive, generalist AI systems.

PuzzleWorld provides a rigorous resource for the ongoing development and benchmarking of multimodal reasoning systems, marking a step towards achieving more advanced, general-purpose AI applicable in real-world problem-solving contexts. The dataset and its annotations are accessible for further research and development at the provided repository.

Markdown