Papers
Topics
Authors
Recent
Search
2000 character limit reached

AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale

Published 13 May 2025 in cs.CL | (2505.08311v2)

Abstract: We present AM-Thinking-v1, a 32B dense LLM that advances the frontier of reasoning, embodying the collaborative spirit of open-source innovation. Outperforming DeepSeek-R1 and rivaling leading Mixture-of-Experts (MoE) models like Qwen3-235B-A22B and Seed1.5-Thinking, AM-Thinking-v1 achieves impressive scores of 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, showcasing state-of-the-art mathematical and coding capabilities among open-source models of similar scale. Built entirely from the open-source Qwen2.5-32B base model and publicly available queries, AM-Thinking-v1 leverages a meticulously crafted post-training pipeline - combining supervised fine-tuning and reinforcement learning - to deliver exceptional reasoning capabilities. This work demonstrates that the open-source community can achieve high performance at the 32B scale, a practical sweet spot for deployment and fine-tuning. By striking a balance between top-tier performance and real-world usability, we hope AM-Thinking-v1 inspires further collaborative efforts to harness mid-scale models, pushing reasoning boundaries while keeping accessibility at the core of innovation. We have open-sourced our model on \href{https://huggingface.co/a-m-team/AM-Thinking-v1}{Hugging Face}.

Summary

Analysis of AM-Thinking-v1: Advancing Reasoning at Moderate Scale

The paper "AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale" details the development and capabilities of AM-Thinking-v1, a 32B dense LLM, which sets new benchmarks in reasoning capabilities among models of similar scale. The research undertaken by Yunjie Ji et al. presents a compelling case for mid-scale models achieving robust performance without resorting to much larger and computationally expensive Mixture-of-Experts (MoE) architectures.

To start, AM-Thinking-v1's performance metrics are quite revealing. It scores 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, outperforming DeepSeek-R1, comparable in performance to other larger MoE counterparts like Qwen3-235B-A22B and Seed1.5-Thinking. Particularly noteworthy is its ability to attain scores that rival models with a significantly higher number of active parameters, demonstrating the efficacy of its training approach and underlying architecture.

The paper places emphasis on the model's training regime, which cleverly utilizes a blend of supervised fine-tuning (SFT) and reinforcement learning (RL). AM-Thinking-v1 builds upon the open-source Qwen2.5-32B base model, utilizing publicly available datasets which cover a comprehensive range of tasks like mathematical reasoning, code generation, and scientific understanding. The two-stage reinforcement learning (RL) method involves dynamically adjusting difficulty levels of queries and using Group Relative Policy Optimization as a training algorithm. This attention to detailed post-training processes facilitates the model's outstanding reasoning capabilities.

The implications of this research are multidimensional. Practically, it highlights that moderate-sized dense models can achieve advanced reasoning skills without the substantial infrastructure overhead associated with large-scale MoE systems. Theoretically, it suggests that careful post-training design, including difficulty-aware query selection and structured response generation, can bridge existing performance gaps between moderately-sized dense models and expansive MoE architectures.

Future developments in AI could see further exploration of optimizing mid-scale models for various tasks or extending similar training methodologies to smaller or more domain-specific models, enhancing accessibility and maintainability. Additionally, addressing limitations such as supporting structured function-calling, tool use, and multimodal inputs might broaden the applicability of such models across diverse contexts.

In conclusion, AM-Thinking-v1 signifies a pivotal step towards harnessing mid-scale LLMs to push the boundaries of reasoning performance while maintaining a balance between efficiency and deployability. As the community contemplates future directions, this paper provides a valuable reference point on leveraging post-training intricacies for maximizing model capabilities at moderate scales.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 32 likes about this paper.