MiniMax-M1: Verified Search & Hybrid Reasoning

Updated 27 January 2026

The paper introduces a formally verified Marsland-style minimax algorithm that uses transposition tables and fail-soft pruning to ensure correct alpha–beta search results.
MiniMax-M1 is a hybrid reasoning model that integrates Mixture-of-Experts with Lightning Attention to support native 1 million-token contexts and scalable RL training.
Experimental results indicate that MiniMax-M1 excels in long-context reasoning and tool use, outperforming similar models in efficiency and benchmark performance.

MiniMax-M1 refers to two distinct but influential algorithms: a formally verified Marsland-style minimax variant in the study of search and verification, and a large-scale, open-weight, hybrid-attention reasoning model in deep learning. The former, detailed in the context of the Dafny verification system, provides a rigorous foundation for understanding depth-limited, transposition-table-based minimax search. The latter, developed as MiniMax-M1 by MiniMax-AI, is a state-of-the-art LLM emphasizing compute efficiency, long-context reasoning, and reinforcement learning (RL) scalability through hybrid Mixture-of-Experts architectures and Lightning Attention. Both represent foundational advances in their respective fields, emphasizing correctness, efficiency, and scalability (Wesselink et al., 24 Sep 2025, MiniMax et al., 16 Jun 2025).

1. Definition and Historical Positioning

In algorithmic game search, MiniMax-M1 describes a depth-limited, transposition-table (TT) variant of the recursive minimax (negamax) framework, corresponding to the original Marsland-style M-variant with fail-soft window narrowing (Wesselink et al., 24 Sep 2025). This variant extends basic minimax by: enforcing a fixed search depth; using hash-mapped transposition tables for result caching; employing previously computed bounds to refine the $\alpha$ – $\beta$ search window; and enabling both transposition- and $\alpha$ – $\beta$ -based pruning.

In large-scale machine learning, MiniMax-M1 is the first open-weight, hybrid-attention reasoning model to efficiently scale test-time compute, enabling native 1 million-token context and extensible “thinking budgets” (maximum token generation per inference or rollout), meeting the demands of long-context reasoning and complex tool use (MiniMax et al., 16 Jun 2025). Built upon MiniMax-Text-01, this model combines Mixture-of-Experts (MoE) with Lightning Attention to balance parameter scale and inference efficiency.

2. Algorithmic Architecture and Formal Specification

Game Search: Marsland-Style Minimax (M1)

MiniMax-M1 is specified as a depth-limited, negamax-algorithm incorporating transposition-table lookups and Fishburn’s fail-soft window-narrowing. Its formal Dafny signature is:

$\beta$ 9 Loop invariants establish the correctness of the recursive traversal, value updates, and $\alpha$ – $\beta$ window management. The witness-based postconditions require that, for a given call, the returned value corresponds to an expansion of the game tree that satisfies the “negamax with transposition-table” semantics. The pseudocode matches Marsland’s NegamaxTTM and annotates table lookup, fail-soft updates, and pruning at both lookup and child iteration.

Large-Scale Reasoning Model: Hybrid MoE and Lightning Attention

MiniMax-M1’s neural architecture comprises:

456 billion total parameters split into 32 experts with per-token activation of 45.9 billion parameters (top-2 gating).
Hybrid block structure: Seven TransNormer blocks (implementing Lightning Attention) are interleaved with one Transformer block (softmax attention). Lightning Attention provides an I/O-aware linear-attention mechanism with $O(nd^2)$ time and $O(nd)$ memory, a significant scaling improvement over standard $O(n^2d)$ attention.
MoE auxiliary loss with a scaled load-balancing term, adjusted in continual pretraining to support large micro-batch sizes.
Native 1 million-token context support, achieved by incremental sequence-length scaling and streaming I/O to prevent memory explosion.

3. Training Methodologies and Optimization

Marsland-Style Minimax Verification

The algorithm’s correctness is established by the Dafny system, which mechanizes witness-based proof obligations:

The postcondition $\mathit{is\_negamax\_tt\_result}(r, u, \alpha, \beta, d)$ requires the existence of a node expansion and corresponding negamax alpha–beta result.
Table-entry validity ensures each cached value matches the required lower/upper/exact semantics for admissible pruning.
Key lemmas (e.g., TableLookupReturnLemma, LoopBreakLemma, TableUpdateLemma) are instantiated to verify table reuse, pruning, and recursive invariants automatically.

In practice, M1’s algorithm is formally proved except in witness-violating scenarios where lower-bound reuse leads to counterexamples (Wesselink et al., 24 Sep 2025).

Hybrid Reasoning Model RL and CISPO

MiniMax-M1 employs large-scale RL over ~161,000 demanding tasks, including mathematical competition problems, logic puzzles, competitive programming, and real-world software engineering (with execution-based rewards in containerized sandboxes).

CISPO (Clipped IS-weight Policy Optimization): Instead of token-level clipping (as in PPO), CISPO clips importance-sampling weights within REINFORCE, providing unbiased gradients and ensuring no tokens are dropped—even for rare “fork” cases. The CISPO objective is:

$\beta$ 0

where $\beta$ 1 and $\beta$ 2 is a group-relative advantage.

Curriculum learning schedules intensive reasoning followed by general-domain RL to avoid catastrophic forgetting.
Hardware/cost efficiency: Training on 512 H800 GPUs for three weeks (US\$\beta $34\times$ + rollout efficiency at 100K tokens.

4. Experimental Results and Benchmark Performance

Marsland-Style Minimax (M1)

Worst-case time complexity remains $\beta$ 4, where $\beta$ 5 is branching factor and $\beta$ 6 is search depth. However, tree size can shrink dramatically due to table and $\beta$ 7– $\beta$ 8 pruning. The Dafny artifacts comprise ~600 lines of main code and 250 lines of proofs, verifying that M1 correctly prunes only when the witness criterion is met—pinpointing specific lower-bound table reuse that violates soundness (Wesselink et al., 24 Sep 2025).

MiniMax-M1 Reasoning Model

MiniMax-M1-80k demonstrates strong open-weight performance across diverse long-context and agentic tool use benchmarks:

Task/Benchmark	Metric	MiniMax-M1-80k Result
AIME 2024	pass@N	86.0% (2nd among open weights)
SWE-bench Verified	exec. rate	56.0%
OpenAI-MRCR 128K	accuracy	73.4%
MRCR 1M	accuracy	56.2%
TAU-bench (airline)	success	62.0%

MiniMax-M1 consistently outperforms or matches DeepSeek-R1 and Qwen3-235B models, particularly excelling in software engineering, tool use, and multi-step reasoning at large context scales (MiniMax et al., 16 Jun 2025). Ablations confirm performance increases as the thinking budget scales from 40,000 to 80,000 tokens.

5. Public Releases, Deployment, and Engineering Considerations

MiniMax-M1 is released in two variants corresponding to their thinking budgets:

MiniMax-M1-40k: max 40,000 tokens per inference/generation (intermediate RL checkpoint).
MiniMax-M1-80k: max 80,000 tokens (full RL-trained model).

“Thinking budget” designates the token ceiling for chained reasoning at inference or RL rollout. Native framework integration is provided for vLLM and Transformers, including custom Lightning Attention kernels and streaming I/O-aware interfaces. Both code and weights are available on GitHub and Hugging Face, with commercial deployments accessible via API (https://minimax.io). Optimization strategies, such as FP32 LM head for reward-probability alignment, AdamW tuning, and early truncation for pathological sequence control, enhance stability and alignment.

6. Impact, Verification, and Open Questions

MiniMax-M1 in formal algorithmic search provides a verified reference for understanding subtle transposition table interactions in fail-soft minimax algorithms. It demonstrates the concrete limits of lower-bound table reuse by isolating witness violations, offering a template for mechanized verification of more complex search variants (Wesselink et al., 24 Sep 2025).

In neural reasoning, MiniMax-M1 showcases the viability of hybrid MoE-Lightning Attention architectures for tractable test-time compute at unprecedented context scales. The model’s efficiency, extensibility, and open-weight release establish a new baseline for long-context and tool-reasoning models. The conceptual integration of large-scale machine-verified correctness and deployable deep reasoning agents remains a significant direction for future research.

References

[Formal Verification of Minimax Algorithms, (Wesselink et al., 24 Sep 2025)]
[MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention, (MiniMax et al., 16 Jun 2025)]

Markdown Report Issue Upgrade to Chat

References (2)

Formal Verification of Minimax Algorithms (2025)

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiniMax-M1.