Missed Connections: Lateral Thinking Puzzles for Large Language Models

Published 17 Apr 2024 in cs.CL and cs.AI | (2404.11730v2)

Abstract: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern LLMs. We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.

Abstract PDF HTML Upgrade to Chat

References (30)

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs, especially GPT-4, significantly improve puzzle-solving accuracy with chain-of-thought prompting over traditional semantic embedding methods.
It employs a Python simulation of the NYT Connections puzzle and evaluates models on a dataset of 250 puzzles to rigorously benchmark performance.
The study highlights that integrating explicit knowledge bases and targeted training could further enhance LLMs' abstract reasoning and lateral thinking capabilities.

Lateral Thinking Challenges for LLMs: An Examination of "Missed Connections"

Introduction

"Missed Connections: Lateral Thinking Puzzles for LLMs" investigates automated AI systems' capability to solve the Connections puzzle. This puzzle, published daily by the New York Times, demands not only semantic understanding but abstract reasoning, making it a robust benchmark for studying LLMs and other NLP systems. The puzzle's increasing complexity from simple to tricky categories requires identifying thematic links among words, showcasing the nuanced understanding and flexible reasoning required to tackle it.

Methodology

Game Setup and Variants: The authors describe the standard Connections puzzle composed of a grid of sixteen words which must be categorized into four related groups. A Python interface replicates the game, including its feedback mechanisms and an enhanced challenging variant for rigorous evaluation.

Data Collection: The paper discusses the assembly of 250 puzzles sourced from an online archive covering a nearly eight-month period. These puzzles serve as the evaluation dataset, with no division into training or validation sets, preserving all puzzles for testing the models' innate capabilities.

Approaches

The paper evaluates two primary approaches:

Sentence Embeddings: Utilizes high-dimensional vectors to capture semantic information, applying models like BERT, RoBERTa, MPNet, and MiniLM. The method involves calculating cosine similarities between word embeddings to predict thematic groups.
LLMs: Specifically, the GPT family. These models receive a detailed prompt including game instructions and current game state, tasked with predicting correct word groups. The authors also explore the impact of adjusting prompts to include chain-of-thought processes, encouraging models to justify their reasoning step-by-step.

Experiments and Results

Baseline Evaluations: The baseline sentence embedding models, particularly MPNet, show that semantic vectors can weakly represent the connections, albeit not as effectively as human solvers. MPNet was able to solve all 150 puzzles within 417 guesses.

LLM Performance: The GPT models, especially GPT-4, performed better than the sentence embeddings in standard settings. GPT-4 significantly outperformed GPT-3.5, and chain-of-thought prompting further improved its accuracy.

Challenge Variant: Testing on a more challenging variant of the puzzle where all guesses must be submitted simultaneously showed that this setup significantly increases difficulty, with notable drops in success rates, particularly when chain-of-thought prompting was used.

Discussion

Semantic Understanding and Abstract Reasoning: The research highlights areas where LLMs struggle, such as non-semantic word properties and context-dependent usages. Despite the challenges, models like GPT-4 show promising capabilities but still fall short of human-level flexibility and insight, particularly in lateral thinking and abstract reasoning.

Chain-of-Thought Prompting: This technique considerably enhances model performance by structuring its reasoning process, which could be synonymous with internal 'thinking' strategies potentially similar to human problem-solving approaches.

Future Developments in AI

Looking forward, there are several pathways for further research:

Improving Solver Performance: Utilizing dedicated training data or iterative refinement of prompts could improve accuracy.
Integrating Explicit Knowledge Bases: Combining LLMs with comprehensive databases like WordNet could enrich the models' semantic understanding.
Puzzle Generation: Exploring LLMs' potential to not only solve but create engaging and complex puzzles could extend their application to creative domains.
Human vs. LLM Puzzle-Solving Strategies: Comparative studies could uncover fundamental differences in problem-solving approaches and cognitive processes between humans and models.

The paper establishes the Connections puzzle as a meaningful benchmark for advancing and evaluating the reasoning capabilities of automated systems, laying a foundation for future explorations into the cognitive-like processes of AI.