Game of 24: Arithmetic Puzzle Analysis

Updated 23 January 2026

Game of 24 is a canonical arithmetic puzzle where four numbers are combined using basic operations to reach the target of 24.
The puzzle serves as a benchmark for evaluating planning and reasoning in computational agents, despite the underlying NP-completeness of the Countdown problem.
Empirical and theoretical studies reveal statistical phase transitions, diverse instance generation methods, and algorithmic challenges, highlighting its value in testing symbolic and learning-based models.

The Game of 24 is a canonical arithmetic puzzle that serves as both a recreational challenge and a benchmark for evaluating planning and reasoning competencies in computational agents. It occupies a special status as the smallest nontrivial member of a broader family known as the Countdown problem, and has been studied in the context of computational complexity, statistical phase transitions, algorithmic hardness, and as a substrate for evaluating symbolic planners and learning-based models including LLMs and generative flow networks (GFlowNets).

1. Formal Definition and Computational Structure

The Game of 24 is precisely formulated as follows: given a multiset of four non-negative integers $\{a, b, c, d\}$ , and access to the standard set of binary operations $O = \{+, -, \times, /\}$ , the objective is to combine the input numbers via a sequence of $n-1=3$ operations. At each step, two elements from the current multiset are selected, a binary operation is applied, and the result is inserted (the pair removed), yielding a new multiset. The task is solved if the final singleton equals the target value 24:

$\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$

A solution is a valid reduction sequence $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ such that the final state is $\{24\}$ . Division uses floating-point arithmetic or is allowed only if integer results are required, according to the variant considered. The state space is defined as all multisets of sizes 4, 3, 2, and 1; transitions correspond to binary operations on unordered pairs.

2. Complexity and Theoretical Properties

The Game of 24 is a special case of the general Countdown problem, which is NP-complete. This complexity follows via several classical reductions:

The Partition Problem (PP), a Karp NP-complete problem, can be reduced to the Subtraction-Addition Problem (SAP), which asks whether signs can be assigned to sum to a target.
An encoding is constructed sending SAP to Countdown using exponentials: let the input be exponents, and use only multiplication/division, as the unique algebraic properties of exponentials block unintended solutions.
By these reductions, the Countdown Decision Problem (CDP)—and hence Game of 24—are in NP, with the certificate being the sequence $\Theta$ of reduction operations (Katz et al., 4 Aug 2025).

Algorithmic analysis reveals exponential state-space complexity:

$b_k = k \cdot (k-1) \cdot 3,\qquad L_j = \frac{3^{j-1} n! (n-1)!}{(n-j)! (n+1-j)!}$

The total number of states is bounded by $\sum_{j=1}^n L_j$ , yielding exponential growth with $n$ . In practice for $O = \{+, -, \times, /\}$ 0, the search is tractable but rapidly becomes intractable for large $O = \{+, -, \times, /\}$ 1.

3. Statistical Properties and Phase Transitions

The Game of 24 exposes a statistical phase transition phenomenon:

For an integer pool $O = \{+, -, \times, /\}$ 2 and sample size $O = \{+, -, \times, /\}$ 3, the probability $O = \{+, -, \times, /\}$ 4 that a randomly drawn four-tuple can be reduced to target $O = \{+, -, \times, /\}$ 5 (e.g., 24) exhibits a sharp threshold as a function of $O = \{+, -, \times, /\}$ 6.
The critical size $O = \{+, -, \times, /\}$ 7 for transition is given empirically by $O = \{+, -, \times, /\}$ 8 (Lacasa et al., 2012).
For $O = \{+, -, \times, /\}$ 9 typical in 24 Game contexts, $n-1=3$ 0, so $n-1=3$ 1 is well above threshold; solvability for random $n-1=3$ 2 is near certain ( $n-1=3$ 3). For $n-1=3$ 4, solvability remains $n-1=3$ 5.

This transition also manifests as an easy–hard–easy pattern: for $n-1=3$ 6, instances are mostly unsolvable (easy-no); $n-1=3$ 7, solutions abound (easy-yes); near $n-1=3$ 8, maximal search difficulty is encountered (maximal algorithmic hardness and time).

4. Instance Generation and Hardness Engineering

Several methods exist for generating Game of 24 instances and more generally Countdown problems:

Generation Method	Principle	Solution Count and Hardness
RG (Reasoning-Gym)	Sample numbers and random ops chain; target set to result if in range	Tends to produce many-solution, easy puzzles
SoS (Stream-of-Search)	BFS from target applying inverse ops	Rarely hard; computationally expensive for $n-1=3$ 9
CD (Candidate Diversity, Editor's term)	Sample inputs; generate many random reduction paths; retro-assign rarest resulting target as $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 0	Produces orders-of-magnitude fewer solutions; creates much harder instances

Empirical evaluation via DFS enumeration up to $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 1 demonstrates that the “CD” method yields instance distributions with one or two orders of magnitude fewer solutions than RG or SoS (Katz et al., 4 Aug 2025). This suggests high instance diversity and consistent hardness, crucial for benchmarking planning agents with minimal risk of memorization.

5. Algorithmic Approaches and Symbolic/LLM Planning

For the canonical Game of 24 and Countdown, both symbolic and learning-based agents are evaluated:

Symbolic Planners (e.g., ENHSP, AutoToS): If encoded in PDDL or directly synthesized into search code, planners can achieve 100% success for small $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 2. ENHSP ceases to scale beyond $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 3, while AutoToS code (blind DFS) can scale to $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 4. An easy–hard–easy phase transition in task difficulty is observed near $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 5, coinciding with the statistical analysis.
LLM-Assisted Methods:
- Planning modes: IO (direct prompting), CoT (chain-of-thought), ToT (tree-of-thought).
- On a widely used “24 Game” public benchmark, ToT/Llama 405B achieves $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 6 accuracy@5, ToT/Qwen 72B $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 7, CoT $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 8, and IO $\text{Game of 24} = \text{Countdown}(I_1 = \{a, b, c, d\}, O, \tau = 24)$ 9.
- On newly generated CD[4] puzzles (disjoint from training), performance dramatically drops: ToT/Llama to $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 0, ToT/Qwen to $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 1, CoT to $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 2, and IO to $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 3– $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 4. The significant performance drop is attributed to memorization effects on public datasets and the increased difficulty of “rare target” generation (Katz et al., 4 Aug 2025).

Notably, error analysis reveals LLMs frequently violate formatting, invent intermediate values, or miss the target; symbolic planners do not exhibit these errors.

6. GFlowNet Methods and Diversity in Solution Sampling

Recent work applies GFlowNets to the Game of 24, focusing on sampling diverse solution sequences:

The state space is represented as a DAG, transitions as valid binary reductions; terminal states with value 24 are rewarded, all others minimally so.
The trajectory balance objective is the principal training criterion:

$\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 5

There is no explicit entropy regularizer or subtrajectory balance penalty.

Test metrics include success rate (fraction of problems with at least one solution sampled in 20 tries) and trajectory count (distinct valid solutions per problem).
Empirically, GFlowNet fine-tuning on LLaMA-1B increases SR from $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 6 and trajectory count per solved problem to $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 7; for LLaMA-3B, SR moves from $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 8, trajectory count from $\Theta = \langle (x_1, o_1, y_1),\ldots,(x_3, o_3, y_3)\rangle$ 9 (Gupta et al., 3 Mar 2025).

A key limitation is that GFlowNet generalization to out-of-distribution puzzles (e.g., the Game of 42) is weak, with brittle dependency on sampling hyperparameters. The absence of explicit subtrajectory balancing can precipitate policy collapse onto dominant solution modes, limiting coverage. The introduction of entropy bonuses or larger, more diverse training sets is recommended.

7. Broader Context, Misconceptions, and Research Implications

The Game of 24 is frequently perceived as “simple” due to its small input size. However, the NP-completeness of Countdown embeds the 24 Game into a maximally challenging computational regime. Public 24 Game benchmarks have been contaminated by exposure in training data, markedly inflating performance of LLMs on “seen” instances while newly synthesized, hard instances expose dramatic accuracy drops. This calls for rigorous, diversity-focused instance generation (e.g., via CD) for meaningful planning benchmarks.

Empirical investigations corroborate analytic predictions from phase transition theory, with maximal search complexity and agent efficiency tightly peaking at the critical input size $\{24\}$ 0. For $\{24\}$ 1, real-time human or agent play, especially under time constraints or when faced with rare-target instances, remains computationally challenging (Lacasa et al., 2012, Katz et al., 4 Aug 2025).

Plausible implications are that the Game of 24 and its generalizations remain vital for evaluating both symbolic algorithms and emerging LLM/generative planner families, highlighting both the limits of current models and the need for robust, hard-instance generation in evaluation.