AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale

Published 13 May 2025 in cs.CL | (2505.08311v2)

Abstract: We present AM-Thinking-v1, a 32B dense LLM that advances the frontier of reasoning, embodying the collaborative spirit of open-source innovation. Outperforming DeepSeek-R1 and rivaling leading Mixture-of-Experts (MoE) models like Qwen3-235B-A22B and Seed1.5-Thinking, AM-Thinking-v1 achieves impressive scores of 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, showcasing state-of-the-art mathematical and coding capabilities among open-source models of similar scale. Built entirely from the open-source Qwen2.5-32B base model and publicly available queries, AM-Thinking-v1 leverages a meticulously crafted post-training pipeline - combining supervised fine-tuning and reinforcement learning - to deliver exceptional reasoning capabilities. This work demonstrates that the open-source community can achieve high performance at the 32B scale, a practical sweet spot for deployment and fine-tuning. By striking a balance between top-tier performance and real-world usability, we hope AM-Thinking-v1 inspires further collaborative efforts to harness mid-scale models, pushing reasoning boundaries while keeping accessibility at the core of innovation. We have open-sourced our model on \href{https://huggingface.co/a-m-team/AM-Thinking-v1}{Hugging Face}.

Abstract PDF Upgrade to Chat

Summary

Analysis of AM-Thinking-v1: Advancing Reasoning at Moderate Scale

The paper "AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale" details the development and capabilities of AM-Thinking-v1, a 32B dense LLM, which sets new benchmarks in reasoning capabilities among models of similar scale. The research undertaken by Yunjie Ji et al. presents a compelling case for mid-scale models achieving robust performance without resorting to much larger and computationally expensive Mixture-of-Experts (MoE) architectures.

To start, AM-Thinking-v1's performance metrics are quite revealing. It scores 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, outperforming DeepSeek-R1, comparable in performance to other larger MoE counterparts like Qwen3-235B-A22B and Seed1.5-Thinking. Particularly noteworthy is its ability to attain scores that rival models with a significantly higher number of active parameters, demonstrating the efficacy of its training approach and underlying architecture.

The paper places emphasis on the model's training regime, which cleverly utilizes a blend of supervised fine-tuning (SFT) and reinforcement learning (RL). AM-Thinking-v1 builds upon the open-source Qwen2.5-32B base model, utilizing publicly available datasets which cover a comprehensive range of tasks like mathematical reasoning, code generation, and scientific understanding. The two-stage reinforcement learning (RL) method involves dynamically adjusting difficulty levels of queries and using Group Relative Policy Optimization as a training algorithm. This attention to detailed post-training processes facilitates the model's outstanding reasoning capabilities.

The implications of this research are multidimensional. Practically, it highlights that moderate-sized dense models can achieve advanced reasoning skills without the substantial infrastructure overhead associated with large-scale MoE systems. Theoretically, it suggests that careful post-training design, including difficulty-aware query selection and structured response generation, can bridge existing performance gaps between moderately-sized dense models and expansive MoE architectures.

Future developments in AI could see further exploration of optimizing mid-scale models for various tasks or extending similar training methodologies to smaller or more domain-specific models, enhancing accessibility and maintainability. Additionally, addressing limitations such as supporting structured function-calling, tool use, and multimodal inputs might broaden the applicability of such models across diverse contexts.

In conclusion, AM-Thinking-v1 signifies a pivotal step towards harnessing mid-scale LLMs to push the boundaries of reasoning performance while maintaining a balance between efficiency and deployability. As the community contemplates future directions, this paper provides a valuable reference point on leveraging post-training intricacies for maximizing model capabilities at moderate scales.

Markdown