LongCat Flash Thinking 2601
- The paper introduces a 560-billion-parameter MoE design that combines domain-parallel training with expert fusion to achieve dense-model-like inference efficiency.
- It integrates asynchronous reinforcement learning and a Heavy Thinking mode, enhancing robust agentic reasoning across diverse tasks and benchmarks.
- The unified training framework and advanced noise modeling techniques substantially improve generalization and resilience in real-world tool-integrated AI environments.
LongCat-Flash-Thinking-2601 is a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model that attains state-of-the-art performance in agentic search, tool-use, and tool-integrated reasoning tasks. The model’s superior generalization, robustness, and agentic capabilities are achieved through an integrated architecture and training pipeline—combining domain-parallel expert training with expert fusion, asynchronous reinforcement learning at extreme scale, explicit noise modeling, advanced environment construction, and a specialized Heavy Thinking mode for test-time scaling. These innovations collectively deliver both performance and efficiency, positioning LongCat-Flash-Thinking-2601 as a leading open-source agentic reasoning model (Team et al., 23 Jan 2026).
1. Mixture-of-Experts Architecture and Computational Design
LongCat-Flash-Thinking-2601 is implemented as a sparse MoE Transformer comprising 560 billion total parameters, with ~27 billion parameters activated per token on average (Sec 1). At each MoE block, a gating network projects the token representation using to compute , producing routing probabilities for . Each token is routed to its top- experts, with expert outputs . The final output for the layer is .
A load-balancing regularizer, (with ), encourages uniform expert usage. By tying gates across several MoE layers and interleaving them with standard transformer layers, the model achieves large capacity at inference costs comparable to dense 27B models.
Zero-computation experts and shortcut-connected MoE (ScMoE) as pioneered in previous LongCat works (Team et al., 1 Sep 2025, Team et al., 23 Sep 2025) further optimize resource usage. Zero-computation experts pass tokens through at negligible FLOPs, while ScMoE overlaps dense feed-forward and expert routing for improved computation–communication efficiency.
2. Unified Training Framework: Domain Parallelism and Co-Design
The training procedure comprises: (i) generic pre-training, (ii) agentic mid-training with structured tool-use data, (iii) multi-domain post-training via reinforcement learning (Sec 2, Sec 3). During post-training, domain-parallel expert training assigns specialized expert sets to each domain (e.g. coding, web, database), such that within each RL stage, only relevant expert subsets are updated (Eq (1)). After domain-wise RL, experts are fused into a unified MoE model and gating parameters are re-optimized for seamless routing.
Training is conducted via an explicit co-design loop linking dataset construction, environment scaling, and RL algorithms. Hybrid trajectory synthesis (combining text, environment, and planning augmentations) and controlled environment expansion ensure coverage across 10,000+ environments and 20+ domains (Sec 3.1.1). The environment scaling process samples tool-chains, expands dependency graphs, and allocates rollouts using complexity and solver-difficulty measures (Eq (2)–(4)), yielding tasks averaging 20–80 nodes and graph densities .
3. Large-Scale Asynchronous Reinforcement Learning with DORA
LongCat-Flash-Thinking-2601 extends the Dynamic ORchestration for Asynchronous rollout (DORA) system (Team et al., 23 Sep 2025) to support 32,000 concurrent environments (Fig 5) with multi-version rollouts and streaming pipelines. This is achieved through producer–consumer decomposition, streaming generation (prefill and decode), asynchronous KV-cache handling, and staleness control via importance clipping in GSPO.
At each optimization step, policy generates samples using potentially multiple model versions. For each group of rollouts, the loss is computed as where and is the advantage estimate.
This asynchronous architecture enables wall-clock speedups exceeding 3× conventional PPO methods on MoE models with tens of thousands of GPUs (Sec 3.1.2), and 1.5× faster rollouts through kernel fusion and graph optimizations.
4. Environment Scaling, Robustness, and Noise Modeling
The automatic environment scaling pipeline converts abstract domain specifications into tool-graph environments, leveraging structural complexity and difficulty-guided sampling to generate tasks with broad tool diversity. Task construction includes instruction synthesis, user profiles, and verification rubrics aligned with each tool-chain (Sec 3.1.2).
Robustness to real-world noise is addressed via noise injection curricula—paraphrasing instructions, adding ambiguity, and simulating tool failures with —while ensuring task solvability. Noise severity is adaptively increased based on validation accuracy, and the robustness gap metrics progression. Empirical ablation yields improvements of +7.8/8.3 points on VitaBench-Noise and -Bench-Noise respectively (Table 3).
5. Heavy Thinking Mode: Parallel and Reflective Reasoning
To optimize inference on complex reasoning, LongCat-Flash-Thinking-2601 introduces Heavy Thinking mode (Sec 4). This consists of parallel generation of distinct reasoning chains to depth , followed by reflective summarization with a summary model that selects the final answer. Total inference cost is , with and controlling the breadth and depth of reasoning. RL fine-tuning of the summarization phase yields performance gains, especially under agentic or noisy conditions.
Sweeps over indicate that , achieves notable improvements (Fig 9), with heavy mode contributing an additional +4–6 points on demanding agentic reasoning benchmarks compared to self-consistency or single-chain generation.
6. Empirical Results, Benchmarks, and Analysis
LongCat-Flash-Thinking-2601 outperforms all open-source competitors across 40+ agentic, reasoning, coding, and robustness benchmarks (Table 2):
- Agentic Search (BrowseComp Pass@1): 56.6% 73.1% with context management.
- -Bench (Avg@4): 80.6 88.2%, VitaBench: 24.0 29.3%.
- Robust-noisy comparison: -Noise 64.1 67.1, Vita-Noise 14.0 20.5.
- Mathematical reasoning (AIME-25 Avg@16): 93.5 99.6% with Heavy Thinking mode.
- Coding: OJBench Pass@1 41.8 44.6%, SWE-bench Verified Avg@5 73.1 73.8%.
Ablation studies indicate that mid-training with structured agentic trajectories improves pass@k by up to +12 points, context management yields +17.5, dynamic rollout allocation by +4.3, robust training closes noise gaps by +7.2, and Heavy Thinking mode outperforms naive ensemble methods (Fig 9).
7. Significance and Implications for Open-Source Agentic AI
LongCat-Flash-Thinking-2601 demonstrates the feasibility and practicality of training extremely large-scale MoE reasoning models with robust agentic capabilities, under realistic and noisy conditions, and across heterogeneous tools and environments. The synthesis of scalable MoE architecture, sophisticated RL infrastructure, adaptive data and environment construction, and test-time parallel reasoning—together with fully open weights—provides a reference platform for future research in agentic intelligence and efficient, high-utility reasoning models.
A plausible implication is that principled environment construction and tool integration, coupled with robust RL and scalable inference, will be foundational for subsequent advances in agentic AI (Team et al., 23 Jan 2026).