Hanoi-World: Puzzles & Autonomous Driving

Updated 11 January 2026

Hanoi-World is a dual-concept framework integrating elaborate Tower of Hanoi extensions and advanced reinforcement learning models for autonomous driving.
In puzzles, it generalizes the classic Tower of Hanoi by introducing colored disks, flipping rules, and recursive strategies that yield non-binary growth rates.
For autonomous driving, it employs a joint embedding predictive architecture with recurrent planning and reward shaping to enhance safety and efficiency.

Hanoi-World encompasses two distinct yet thematically linked threads in recent research: (1) the combinatorial generalization of Tower of Hanoi puzzles under additional constraints and colorings (especially the Magnetic Tower of Hanoi and its "Hanoi-World" state spaces), and (2) the application of advanced joint embedding architectures for autonomous vehicle world modeling and planning, as embodied by the HanoiWorld world model for reinforcement learning-based driving. This article reviews the principles, algorithmic mechanics, mathematical properties, and evaluation protocols governing these developments, with precise reference to (Levy, 2010) and (Dat et al., 4 Jan 2026).

1. Definition and Conceptual Scope

The term "Hanoi-World" denotes both a mathematically enriched state space structure for combinatorial puzzles such as the Magnetic Tower of Hanoi (MToH) (Levy, 2010) and, independently, a Joint Embedding Predictive Architecture (JEPA)-based world model for highly sample-efficient, safety-aware autonomous driving (Dat et al., 4 Jan 2026). In combinatorics, "Hanoi-World" refers to an extension of the classic 3-peg Tower of Hanoi universe (spanning $2^N-1$ moves) to 'base-3' puzzles (spanning $3^N$ ), via added disk colorations and flipping rules, leading to a family of algorithms with varying move complexities. In machine learning, HanoiWorld is a model that leverages self-supervised visual embedding and recurrent planning for action selection in autonomous vehicles, capable of effectively managing safety in traffic scenarios.

2. Combinatorial Hanoi-World: Magnetic Tower of Hanoi Extensions

The Magnetic Tower of Hanoi puzzle generalizes the Tower of Hanoi by granting each disk two colored faces and prescribing that no like colors may touch; disks are flipped with every move. Three major "flavors," or algorithmic regimes, are defined:

Fully Colored ("CMToH," '100%'): Each post is permanently colored (e.g., Red, Blue, Blue), restricting disk placement strictly by color mismatch.
Semi-Free ("SF"): Two posts are colored, the third initially free; enables hybrid strategies.
Free (Dynamically Colored): Post colors evolve as the disks move (posts may become Red, Blue, or Neutral).

Key constraints are the Size Rule (disks cannot be placed atop strictly smaller disks) and the Magnet Rule (never place a disk so that like colors are contiguous).

Recursive relations and closed-form solutions describe minimal-move sequences in each regime. For instance, the fully colored variant has move count $S_{100}(N) = (3^N-1)/2$ , while optimized strategies exploiting neutrality reduce the leading constant (e.g., $S_{62}(N)\sim (67/108)3^N$ ).

Each flavor's solution exploits different decompositions of the underlying state-graph, which is a subgraph of the ternary cube $\{R,B,\varnothing\}^N$ , exhibiting $S_3$ symmetry. This combinatorial "Hanoi-World" illustrates how disk flipping and color-constraint management lead to richer classes of recursive algorithms with non-binary growth rates. Potential generalizations include more than two colors per disk, greater post counts, or weighted moves (Levy, 2010).

3. The HanoiWorld World Model for Autonomous Driving

The HanoiWorld architecture (Dat et al., 4 Jan 2026) consists of an encoder, joint embedding space, world-model dynamics, and actor–critic policy head:

Encoder: Utilizes a frozen, pretrained V-JEPA v2 (Vision Transformer, student–teacher) as a feature extractor to process BEV RGB frames (64×64). A lightweight MLP projects outputs to latent embeddings $z_t\in\mathbb{R}^{128}$ , with a 2D-CNN for masked patch prediction and VICReg regularization.
Joint Embedding Predictive Architecture: Embeddings $z_t$ are shared across RSSM world-model (posterior/prior) and actor–critic modules.
World Model: Integrates a Recurrent State Space Model (RSSM) with RNN-based deterministic states ( $h_t$ ), stochastic latents ( $z_t$ ), and learned reward $p_\phi(r_t|h_t,z_t)$ and continuation $p_\phi(c_t|h_t,z_t)$ predictors.
Policy: The actor $\pi_\theta(a_t|\ell_t)$ and critic $v_\psi(\ell_t)$ operate in latent space ( $\ell_t=(h_t,z_t)$ ), outputting continuous steering and acceleration.

A distinguishing feature is the abandonment of pixel-level reconstruction for world-model rollout, using embedding-level patch prediction as the self-supervised learning target.

4. Mathematical Formalism and Training Losses

JEPA Encoder Losses

Alignment loss: $L_{\text{align}} = \Vert P_\phi(\Delta, E_\theta(x)) - \text{sg}[E_{\bar\theta}(y)]\Vert_1$
Variance/Invariant/Covariance losses: As per VICReg, controlling collapse in embedding space.
Combined loss: $L_{\text{enc}} = \alpha L_{\text{align}} + \beta L_{\text{var}} + \gamma L_{\text{cov}}$

World-Model Losses

Predictive: $L_{\text{pred}} = -\mathbb{E}[\,\log p_\phi(x_t|h_t,z_t) + \log p_\phi(r_t|h_t,z_t) + \log p_\phi(c_t|h_t,z_t)\,]$
Dynamics KL: $L_{\text{dyn}}$ and $L_{\text{rep}}$ as KL divergences with stabilizing max.
Total: $L_{\text{world}} = L_{\text{pred}} + \lambda_{\text{dyn}}L_{\text{dyn}} + \lambda_{\text{rep}}L_{\text{rep}}$

Actor–Critic Losses

Value: $L_{\text{value}} = -\mathbb{E}[\log p_\psi(G^\lambda_t|f_t)]$
Actor: $L_{\text{actor}} = -\mathbb{E}[\log\pi_\theta(a_t|f_t)A_t + \beta_H H[\pi_\theta(\cdot|f_t)]]$ , with $A_t = \text{sg}[G^\lambda_t - V_\psi(f_t)]$

5. Planning Dynamics and Inference

During training, imagined planning (horizontal rollout in latent space) is performed using the RSSM recurrent structure. Real states obtained from the environment seed rollout, and at every planning step, policy, world-model, reward, and continuation are predicted in latent space. Inference at deployment directly encodes new observations, updates the recurrent state, and applies policy—achieving sub-millisecond step time with no beam search.

6. Safety-Awareness and Reward Shaping

Safety is incorporated into HanoiWorld at the model-reward design level:

Reward shaping: Penalizes collisions (e.g., highway: $-5.0$ ), tailgating, and unsafe behaviors; rewards lane-alignment, survival, and success.
Continuation predictor: Learns to signal episode termination post-collision.
No explicit constraint loss: Safety emerges from incentivized policy shaping via $r_t$ .

This approach is intended to counteract tendencies of RL agents to exploit unmodeled loopholes and to ground driving policy in physically and ethically salient criteria (Dat et al., 4 Jan 2026).

7. Training, Evaluation, and Empirical Results

Training proceeds via curriculum learning over Gymnasium environments (highway-v0, merge-v0, roundabout-v0). The encoder is first pretrained on nuScenes BEV data, followed by concurrent model and actor–critic updates with uniform sampling from recent episode buffers.

Evaluation metrics are:

Method	Highway Collision Rate	Merge Collision Rate	Roundabout Collision Rate	Highway Avg. Reward	Merge Avg. Reward	Roundabout Avg. Reward
DreamerV3	0.550 ± 0.497	0.030 ± 0.171	0.500 ± 0.500	51.065 ± 52.700	41.973 ± 2.896	9.423 ± 6.171
VQ-VAE	1.000 ± 0.000	0.290 ± 0.454	0.570 ± 0.495	3.121 ± 12.047	30.114 ± 11.664	3.826 ± 4.643
HanoiWorld	0.200 ± 0.400	0.970 ± 0.170	0.340 ± 0.473	13.163 ± 23.277	13.480 ± 5.703	9.818 ± 6.252

HanoiWorld achieves lowest collision rates in highway and roundabout environments and competitive average reward in roundabout settings. Its inferior merge-v0 performance is attributed to "reward hacking" where collisions retain some shaped reward. The entropy regularization parameter $\beta_H$ directly modulates the conservativeness of the driving policy, with lower values yielding riskier behavior and higher values resulting in overly cautious (potentially deadlocked) behavior (Dat et al., 4 Jan 2026).

8. Connections and Generalizations

The combinatorial "Hanoi-World" provides a mathematical metaphor for complex state-space traversals with non-trivial constraints, analogous to higher-dimensional lattice walks and colored Gray codes. The world model “HanoiWorld” in the RL context shares this spirit: rather than modeling raw pixel transitions, it operates in a compositional latent geometry, using JEPA objectives and sequential reasoning akin to combinatorial search through constrained configuration spaces.

Possible research directions include the extension to more complex action/state spaces, alternative reward structures, or generalizations of the magnetic-Hanoi puzzle to higher "bases" or further symmetry groups, as alluded in (Levy, 2010) and structurally echoed by the latent space design in (Dat et al., 4 Jan 2026).

References

"The Magnetic Tower of Hanoi" (Levy, 2010)
"HanoiWorld : A Joint Embedding Predictive Architecture Based World Model for Autonomous Vehicle Controller" (Dat et al., 4 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (2)

The Magnetic Tower of Hanoi (2010)

HanoiWorld : A Joint Embedding Predictive Architecture BasedWorld Model for Autonomous Vehicle Controller (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hanoi-World.