Ascend-CoT: Optimized CoT Reasoning on NPUs
- The paper demonstrates that low-bit post-training quantization on Huawei Ascend NPUs retains over 90% accuracy while reducing memory usage by 20–40% and accelerating inference by up to 1.5×.
- Ascend-CoT introduces adaptable Chain-of-Thought modes (no_think, auto_think, slow_think) that balance reasoning detail and computational resources for diverse task complexities.
- The study details integer-based quantization methods and NPU-tailored optimizations, such as kernel fusion and on-chip dequantization, to enable efficient on-device multi-step reasoning.
Ascend-CoT refers to a recipe and workflow for optimizing Chain-of-Thought (CoT) reasoning in LLMs—specifically, Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B—through low-bit post-training quantization and tailored inference on Huawei’s Ascend NPUs (notably Atlas A2). This approach enables efficient deployment of advanced CoT prompting paradigms that would otherwise incur prohibitive memory and latency costs, allowing practical on-device reasoning without retraining or significant loss in accuracy (Luo et al., 29 Dec 2025).
1. Chain-of-Thought Reasoning Paradigms on OpenPangu-Embedded
The openPangu-Embedded models implement three prompt-controlled CoT paradigms:
- no_think: Emits only the final answer, generating minimal intermediate trace. This mode minimizes both memory and latency but can degrade performance on reasoning tasks with high complexity.
- slow_think: Produces explicit, step-wise intermediate reasoning. It delivers the most detailed trace, maximizing problem-solving transparency and effectiveness for complex multi-step tasks, at the expense of significant activation memory and increased inference latency.
- auto_think: Dynamically modulates between no_think and slow_think based on input complexity. Average trace lengths and computational cost fall between the extremes, aiming to optimize the quality-resource tradeoff.
The qualitative impact on resource utilization is monotonic: activation memory and end-to-end latency scale as slow_think ≫ auto_think ≫ no_think. This establishes the central challenge: detailed CoT reasoning, while essential for solving harder tasks, severely taxes NPU memory and compute due to the accumulation of intermediate activations.
2. Low-Bit Post-Training Quantization: Methodology and Algorithms
To address inference bottlenecks, Ascend-CoT employs integer-based post-training quantization. The framework utilizes symmetric, layer-wise quantization of both weights and activations:
- Weight Quantization (per-tensor):
- For floating-point weights , compute .
- Quantized integer weights , with .
- Activation Quantization (per-layer):
- For activations at runtime, calibrated on representative data.
- Quantized activations .
- Dequantization (mixed precision):
- For output tensors: , .
These procedures support both INT8 (W8A8) and W4A8 quantization formats. W4A8 improves memory efficiency further but can incur a moderate accuracy penalty without advanced calibration.
3. Unified Low-Bit Inference Framework on Atlas A2 NPU
Ascend-CoT leverages the CATLASS operator library native to the Atlas A2, supporting direct INT8 and INT4 computation:
- Weight layout transformation: Quantized weights are stored in a blocked format optimized for CATLASS.
- On-chip dequantization: Integer matrices are scaled to FP16 where needed (e.g., for non-quantized downstream operations).
- Integer GEMM: Matrix multiplications in Transformer layers proceed natively in INT8 or INT4×INT8.
- NPU-specific kernel fusion: Scaling/unscaling is fused with softmax and layer normalization; per-token activation scaling is employed to constrain dynamic range and avert overflow.
- Double-buffered SRAM: Quantized GEMMs are overlapped with activation quantization to maximize throughput.
The overall inference workflow for a token or block consists of loading , quantizing input activations , computing integer GEMM , dequantizing to if needed, and continuing through the attention/FFN stacks.
4. Empirical Results: Accuracy, Speed, and Memory Trade-offs
Comprehensive evaluations on the HumanEval and MBPP code-generation benchmarks for the 7B openPangu-Embedded model (with similar trends for 1B) demonstrate:
Accuracy Retention (INT8):
| CoT Mode | FP16 | INT8 |
|---|---|---|
| no_think | 85.37 | 85.37 |
| auto_think | 95.12 | 97.56 |
| slow_think | 95.12 | 95.73 |
INT8 achieves ≥90% of FP16 baseline accuracy across all modes. In some cases (auto_think), INT8 marginally outperforms FP16.
W4A8 Trade-offs (slow_think):
| Variant | Accuracy |
|---|---|
| FP16 | 95.12 |
| W4A8-baseline | 79.88 |
| W4A8+SmoothQuant | 93.29 |
| W4A8+Hadamard | 94.51 |
Baseline W4A8 imposes ~15pp accuracy drop, but enhanced calibrations (SmoothQuant, Hadamard rotation) recover most of the loss.
Latency and Memory (batch size 32):
| Precision | Latency (ms) | Memory (GB) |
|---|---|---|
| FP16 | 45.31 | 30.13 |
| INT8 | 30.21 (1.5×) | 23.83 (−21%) |
INT8 yields a 1.5× inference speedup and 21–37% memory reduction (greater savings at small batch sizes).
5. Mitigating CoT Memory and Latency Overhead
The primary value proposition of Ascend-CoT lies in extending the feasibility of slow_think and auto_think modes within stringent on-device hardware budgets:
- Memory: Slow_think can triple activation storage; INT8/A8 quantization halves this, permitting longer and richer reasoning traces.
- Latency: Each additional reasoning step incurs a Transformer forward pass. With INT8 GEMM, per-step compute time is reduced by 30–40% relative to FP16, making long CoT chains tractable.
- Dynamic benefit: Auto_think mode enjoys the greatest improvement, as its dynamic fallback to slow_think no longer exacts a severe resource penalty.
For deployment scenarios with batch size 1, INT8 still achieves ~1.2× throughput and 30–35% memory savings, enabling practical, interactive CoT inference on-device.
6. Limitations and Prospective Directions
- Accuracy-Complexity Tradeoff: W4A8 baseline presents noticeable accuracy reductions; mitigation via advanced calibration increases implementation complexity.
- Scale Calibration: Quantization scale factors are currently calibrated offline; integrating online, per-sequence calibration could stabilize outputs during shifting runtime distributions.
- Layer Precision Heterogeneity: The method does not yet employ mixed-precision assignment (e.g., INT8 for the majority of layers, FP16 for layers sensitive to quantization noise); automated mixed-precision search tailored to CoT workloads is an open area.
- Hardware Co-design: Developing INT4×INT4 kernel support may lower memory/compute even further, particularly relevant for emerging multi-modal or ultra-lightweight edge deployments.
7. Summary and Significance
By encapsulating a suite of prompt-level CoT controls, post-training low-bit quantization (INT8/W4A8), and NPU-oriented kernel optimizations, Ascend-CoT facilitates high-fidelity, resource-constrained, multi-step reasoning on Ascend NPUs. The approach maintains over 90% of the baseline accuracy across all CoT paradigms, delivers up to 1.5× speedup, and reduces memory consumption by 20–40%. This positions Ascend-CoT as a practical and robust solution for deploying advanced LLM reasoning capabilities in on-device and memory-limited environments (Luo et al., 29 Dec 2025).