SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Published 28 Jul 2025 in cs.LG and cs.AI | (2507.20984v2)

Abstract: While frontier LLMs continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a new family of LLMs designed from scratch for local deployment by leveraging dual-level structural sparsity and mixture-of-experts architectures.
The models utilize innovations such as pre-attention routing, expert offloading, and ReGLU-based sparsity to optimize computational resources and achieve competitive MMLU scores.
The study demonstrates that on-device large language models can attain state-of-the-art performance, paving the way for broader AI accessibility in resource-constrained settings.

SmallThinker: A Family of Efficient LLMs Natively Trained for Local Deployment

Introduction

The paper introduces SmallThinker, a series of LLMs specifically designed for deployment on local devices with limited computational power and storage capabilities. Unlike traditional models that are compressed for local use, SmallThinker is architected from scratch with a deployment-aware design, transforming resource constraints into fundamental design principles. The models achieve state-of-the-art (SOTA) performance, demonstrating the feasibility of local deployment without reliance on GPU-accelerated cloud infrastructure.

Figure 1: A comparison of inference performance and MMLU scores. SmallThinker achieves SOTA performance, outperforming comparable models in both speed and accuracy.

Model Architecture

The SmallThinker models integrate two-level structural sparsity combining Mixture-of-Experts (MoE) architecture with sparse feed-forward networks. This design drastically reduces computational requirements while maintaining model capacity.

Fine-Grained Mixture of Experts: Employing fine-grained MoE architecture enhances parameter efficiency, with configurations of 32 and 64 experts for 4B and 21B models, respectively.
Sparse ReGLU-Based FFN: Utilizes ReGLU activation functions for inducing neuron-level sparsity within experts, lowering computational and I/O demands.
Pre-Attention Router: Positioned before the attention block to prefetch necessary expert parameters, it effectively hides storage latency during on-device inference.
DP-Groups Global Load Balance Loss: Facilitates expert specialization, aiding efficient expert caching strategies while preserving training stability.
Figure 2: SmallThinker model architecture. Residual connections are omitted for clarity.

Pre-Training

SmallThinker models are pre-trained using a three-stage curriculum on a comprehensive, high-quality dataset mix, encompassing diverse domains such as General Knowledge, Mathematics, and Code. The training involves progressive adjustment of data composition and integration of both open-source datasets and synthetically generated data.

Pretraining Details

Data Collection: Sources include FineWeb-Edu, MegaMath, and StackV2 for comprehensive domain knowledge inclusion.
Synthetic Augmentation: Use of MGA-style and persona-driven methodologies to enrich Mathematics and Code domains.
Training Setup: SmallThinker-4B-A0.6B utilizes a token horizon of 2.5 trillion tokens while SmallThinker-21B assumes 7.2 trillion tokens, involving structured sequence length adjustments and dynamic learning rate decay.

Model Evaluation

Evaluation on multiple benchmarks indicates SmallThinker models achieve comparable or superior results relative to larger baselines, showcasing excellent parameter utilization.

Figure 3: Learning curve of SmallThinker-21B-A3B-Base on the MMLU benchmark, showing 5-shot accuracy vs. training tokens (in billions).

Performance Metrics

Comparison against models like Gemma3, Qwen3, and Phi4 demonstrates competitive scores across diverse tasks including MMLU, GPQA-Diamond, and HumanEval. Notably, SmallThinker models excel in encoding efficiency and task adaptability.

Figure 4: Expert activation frequency heatmaps of SmallThinker-21B-A3B.

Inference Framework for Local Devices

SmallThinker designs emphasize sparse inference strategies and memory-efficient inference for optimal performance on resource-constrained devices.

Memory-Efficient Inference

Expert Offloading: Implementing parameter offloading mechanism that optimally utilizes SSD storage, guided by expert activation locality.
Prefetching Techniques: Employ pipeline methodologies to interleave I/O tasks with computational processes, optimizing latency management.

Sparse Inference

ReGLU Sparsity Optimization: Utilizing selective computation strategies focusing on sparse activation outputs, paired with SIMD vectorization.
LM Head Sparsity: Predictor module selectively computes logits, minimizing computational demands.

Limitations and Future Work

While effective, the scale of pretraining data constrains SmallThinker's broader applicability. Future efforts include expanding the dataset and implementing RLHF for improved model alignment and response quality.

(Figure 5 and Figure 6)

Figure 5: Expert activation frequency heatmaps of SmallThinker-4B-A0.6B.

Figure 6: The neuron-level sparsity across layers for SmallThinker model family.

Conclusion

SmallThinker showcases a novel approach to AI deployment, providing capable models for local devices that offer significant computational and memory efficiency gains, setting a new pathway for democratizing AI across consumer devices globally. Further improvements will aim to expand its knowledge base and refine its alignment for robust, real-world applications.