Papers
Topics
Authors
Recent
Search
2000 character limit reached

MiniMax-M1: Verified Search & Hybrid Reasoning

Updated 27 January 2026
  • The paper introduces a formally verified Marsland-style minimax algorithm that uses transposition tables and fail-soft pruning to ensure correct alpha–beta search results.
  • MiniMax-M1 is a hybrid reasoning model that integrates Mixture-of-Experts with Lightning Attention to support native 1 million-token contexts and scalable RL training.
  • Experimental results indicate that MiniMax-M1 excels in long-context reasoning and tool use, outperforming similar models in efficiency and benchmark performance.

MiniMax-M1 refers to two distinct but influential algorithms: a formally verified Marsland-style minimax variant in the study of search and verification, and a large-scale, open-weight, hybrid-attention reasoning model in deep learning. The former, detailed in the context of the Dafny verification system, provides a rigorous foundation for understanding depth-limited, transposition-table-based minimax search. The latter, developed as MiniMax-M1 by MiniMax-AI, is a state-of-the-art LLM emphasizing compute efficiency, long-context reasoning, and reinforcement learning (RL) scalability through hybrid Mixture-of-Experts architectures and Lightning Attention. Both represent foundational advances in their respective fields, emphasizing correctness, efficiency, and scalability (Wesselink et al., 24 Sep 2025, MiniMax et al., 16 Jun 2025).

1. Definition and Historical Positioning

In algorithmic game search, MiniMax-M1 describes a depth-limited, transposition-table (TT) variant of the recursive minimax (negamax) framework, corresponding to the original Marsland-style M-variant with fail-soft window narrowing (Wesselink et al., 24 Sep 2025). This variant extends basic minimax by: enforcing a fixed search depth; using hash-mapped transposition tables for result caching; employing previously computed bounds to refine the α\alphaβ\beta search window; and enabling both transposition- and α\alphaβ\beta-based pruning.

In large-scale machine learning, MiniMax-M1 is the first open-weight, hybrid-attention reasoning model to efficiently scale test-time compute, enabling native 1 million-token context and extensible “thinking budgets” (maximum token generation per inference or rollout), meeting the demands of long-context reasoning and complex tool use (MiniMax et al., 16 Jun 2025). Built upon MiniMax-Text-01, this model combines Mixture-of-Experts (MoE) with Lightning Attention to balance parameter scale and inference efficiency.

2. Algorithmic Architecture and Formal Specification

Game Search: Marsland-Style Minimax (M1)

MiniMax-M1 is specified as a depth-limited, negamax-algorithm incorporating transposition-table lookups and Fishburn’s fail-soft window-narrowing. Its formal Dafny signature is:

1
2
3
4
5
6
7
8
9
10
11
method MinimaxM1(u: Node,
                 alpha0: bounded_int,
                 beta0: bounded_int,
                 depth: nat)
  returns (result: bounded_int)
  modifies this.T
  requires alpha0 < beta0
  requires turn_based()
  requires is_valid_table(T)
  ensures  is_negamax_tt_result(result, u, alpha0, beta0, depth)
  ensures  is_valid_table(T)
Loop invariants establish the correctness of the recursive traversal, value updates, and α\alphaβ\beta window management. The witness-based postconditions require that, for a given call, the returned value corresponds to an expansion of the game tree that satisfies the “negamax with transposition-table” semantics. The pseudocode matches Marsland’s NegamaxTTM and annotates table lookup, fail-soft updates, and pruning at both lookup and child iteration.

Large-Scale Reasoning Model: Hybrid MoE and Lightning Attention

MiniMax-M1’s neural architecture comprises:

  • 456 billion total parameters split into 32 experts with per-token activation of 45.9 billion parameters (top-2 gating).
  • Hybrid block structure: Seven TransNormer blocks (implementing Lightning Attention) are interleaved with one Transformer block (softmax attention). Lightning Attention provides an I/O-aware linear-attention mechanism with O(nd2)O(nd^2) time and O(nd)O(nd) memory, a significant scaling improvement over standard O(n2d)O(n^2d) attention.
  • MoE auxiliary loss with a scaled load-balancing term, adjusted in continual pretraining to support large micro-batch sizes.
  • Native 1 million-token context support, achieved by incremental sequence-length scaling and streaming I/O to prevent memory explosion.

3. Training Methodologies and Optimization

Marsland-Style Minimax Verification

The algorithm’s correctness is established by the Dafny system, which mechanizes witness-based proof obligations:

  • The postcondition is_negamax_tt_result(r,u,α,β,d)\mathit{is\_negamax\_tt\_result}(r, u, \alpha, \beta, d) requires the existence of a node expansion and corresponding negamax alpha–beta result.
  • Table-entry validity ensures each cached value matches the required lower/upper/exact semantics for admissible pruning.
  • Key lemmas (e.g., TableLookupReturnLemma, LoopBreakLemma, TableUpdateLemma) are instantiated to verify table reuse, pruning, and recursive invariants automatically.

In practice, M1’s algorithm is formally proved except in witness-violating scenarios where lower-bound reuse leads to counterexamples (Wesselink et al., 24 Sep 2025).

Hybrid Reasoning Model RL and CISPO

MiniMax-M1 employs large-scale RL over ~161,000 demanding tasks, including mathematical competition problems, logic puzzles, competitive programming, and real-world software engineering (with execution-based rewards in containerized sandboxes).

  • CISPO (Clipped IS-weight Policy Optimization): Instead of token-level clipping (as in PPO), CISPO clips importance-sampling weights within REINFORCE, providing unbiased gradients and ensuring no tokens are dropped—even for rare “fork” cases. The CISPO objective is:

JCISPO(θ)=Eq,o[1oi,tsg(r^i,t)A^i,tlogπθ(oi,t)]J_{CISPO}(\theta) = \mathbb{E}_{q, o} \left[ \frac{1}{\sum |o|} \sum_{i, t} \mathbf{sg}(\hat{r}_{i, t}) \cdot \widehat{A}_{i, t} \cdot \log \pi_\theta(o_{i, t} | \cdot) \right]

where r^i,t=clip(ri,t,1εlowIS,1+εhighIS)\hat{r}_{i, t} = \operatorname{clip}(r_{i, t}, 1-\varepsilon_{low}^{IS}, 1+\varepsilon_{high}^{IS}) and A^i,t\widehat{A}_{i, t} is a group-relative advantage.

  • Curriculum learning schedules intensive reasoning followed by general-domain RL to avoid catastrophic forgetting.
  • Hardware/cost efficiency: Training on 512 H800 GPUs for three weeks (US\$534,700), utilizing lightning attention to realize4×4\times+ rollout efficiency at 100K tokens.

4. Experimental Results and Benchmark Performance

Marsland-Style Minimax (M1)

Worst-case time complexity remains O(bd)O(b^d), where bb is branching factor and dd is search depth. However, tree size can shrink dramatically due to table and α\alphaβ\beta pruning. The Dafny artifacts comprise ~600 lines of main code and 250 lines of proofs, verifying that M1 correctly prunes only when the witness criterion is met—pinpointing specific lower-bound table reuse that violates soundness (Wesselink et al., 24 Sep 2025).

MiniMax-M1 Reasoning Model

MiniMax-M1-80k demonstrates strong open-weight performance across diverse long-context and agentic tool use benchmarks:

Task/Benchmark Metric MiniMax-M1-80k Result
AIME 2024 pass@N 86.0% (2nd among open weights)
SWE-bench Verified exec. rate 56.0%
OpenAI-MRCR 128K accuracy 73.4%
MRCR 1M accuracy 56.2%
TAU-bench (airline) success 62.0%

MiniMax-M1 consistently outperforms or matches DeepSeek-R1 and Qwen3-235B models, particularly excelling in software engineering, tool use, and multi-step reasoning at large context scales (MiniMax et al., 16 Jun 2025). Ablations confirm performance increases as the thinking budget scales from 40,000 to 80,000 tokens.

5. Public Releases, Deployment, and Engineering Considerations

MiniMax-M1 is released in two variants corresponding to their thinking budgets:

  • MiniMax-M1-40k: max 40,000 tokens per inference/generation (intermediate RL checkpoint).
  • MiniMax-M1-80k: max 80,000 tokens (full RL-trained model).

“Thinking budget” designates the token ceiling for chained reasoning at inference or RL rollout. Native framework integration is provided for vLLM and Transformers, including custom Lightning Attention kernels and streaming I/O-aware interfaces. Both code and weights are available on GitHub and Hugging Face, with commercial deployments accessible via API (https://minimax.io). Optimization strategies, such as FP32 LM head for reward-probability alignment, AdamW tuning, and early truncation for pathological sequence control, enhance stability and alignment.

6. Impact, Verification, and Open Questions

MiniMax-M1 in formal algorithmic search provides a verified reference for understanding subtle transposition table interactions in fail-soft minimax algorithms. It demonstrates the concrete limits of lower-bound table reuse by isolating witness violations, offering a template for mechanized verification of more complex search variants (Wesselink et al., 24 Sep 2025).

In neural reasoning, MiniMax-M1 showcases the viability of hybrid MoE-Lightning Attention architectures for tractable test-time compute at unprecedented context scales. The model’s efficiency, extensibility, and open-weight release establish a new baseline for long-context and tool-reasoning models. The conceptual integration of large-scale machine-verified correctness and deployable deep reasoning agents remains a significant direction for future research.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiniMax-M1.