Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ascend-CoT: Optimized CoT Reasoning on NPUs

Updated 19 January 2026
  • The paper demonstrates that low-bit post-training quantization on Huawei Ascend NPUs retains over 90% accuracy while reducing memory usage by 20–40% and accelerating inference by up to 1.5×.
  • Ascend-CoT introduces adaptable Chain-of-Thought modes (no_think, auto_think, slow_think) that balance reasoning detail and computational resources for diverse task complexities.
  • The study details integer-based quantization methods and NPU-tailored optimizations, such as kernel fusion and on-chip dequantization, to enable efficient on-device multi-step reasoning.

Ascend-CoT refers to a recipe and workflow for optimizing Chain-of-Thought (CoT) reasoning in LLMs—specifically, Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B—through low-bit post-training quantization and tailored inference on Huawei’s Ascend NPUs (notably Atlas A2). This approach enables efficient deployment of advanced CoT prompting paradigms that would otherwise incur prohibitive memory and latency costs, allowing practical on-device reasoning without retraining or significant loss in accuracy (Luo et al., 29 Dec 2025).

1. Chain-of-Thought Reasoning Paradigms on OpenPangu-Embedded

The openPangu-Embedded models implement three prompt-controlled CoT paradigms:

  • no_think: Emits only the final answer, generating minimal intermediate trace. This mode minimizes both memory and latency but can degrade performance on reasoning tasks with high complexity.
  • slow_think: Produces explicit, step-wise intermediate reasoning. It delivers the most detailed trace, maximizing problem-solving transparency and effectiveness for complex multi-step tasks, at the expense of significant activation memory and increased inference latency.
  • auto_think: Dynamically modulates between no_think and slow_think based on input complexity. Average trace lengths and computational cost fall between the extremes, aiming to optimize the quality-resource tradeoff.

The qualitative impact on resource utilization is monotonic: activation memory and end-to-end latency scale as slow_think ≫ auto_think ≫ no_think. This establishes the central challenge: detailed CoT reasoning, while essential for solving harder tasks, severely taxes NPU memory and compute due to the accumulation of intermediate activations.

2. Low-Bit Post-Training Quantization: Methodology and Algorithms

To address inference bottlenecks, Ascend-CoT employs integer-based post-training quantization. The framework utilizes symmetric, layer-wise quantization of both weights and activations:

  • Weight Quantization (per-tensor):
    • For floating-point weights WfW_f, compute sw=127/maxWfs_w = 127 / \max |W_f|.
    • Quantized integer weights Wq=round(Wfsw)W_q = \mathrm{round}(W_f \cdot s_w), with Wq[127,127]W_q \in [-127,127].
  • Activation Quantization (per-layer):
    • For activations AfA_f at runtime, sa=127/maxAfs_a = 127 / \max |A_f| calibrated on representative data.
    • Quantized activations Aq=clip(round(Afsa),127,127)A_q = \mathrm{clip}(\mathrm{round}(A_f \cdot s_a), -127, 127).
  • Dequantization (mixed precision):
    • For output tensors: WfWq/swW_f \approx W_q / s_w, AfAq/saA_f \approx A_q / s_a.

These procedures support both INT8 (W8A8) and W4A8 quantization formats. W4A8 improves memory efficiency further but can incur a moderate accuracy penalty without advanced calibration.

3. Unified Low-Bit Inference Framework on Atlas A2 NPU

Ascend-CoT leverages the CATLASS operator library native to the Atlas A2, supporting direct INT8 and INT4 computation:

  • Weight layout transformation: Quantized weights are stored in a blocked format optimized for CATLASS.
  • On-chip dequantization: Integer matrices are scaled to FP16 where needed (e.g., for non-quantized downstream operations).
  • Integer GEMM: Matrix multiplications in Transformer layers proceed natively in INT8 or INT4×INT8.
  • NPU-specific kernel fusion: Scaling/unscaling is fused with softmax and layer normalization; per-token activation scaling is employed to constrain dynamic range and avert overflow.
  • Double-buffered SRAM: Quantized GEMMs are overlapped with activation quantization to maximize throughput.

The overall inference workflow for a token or block consists of loading WqW_q, quantizing input activations Aq\rightarrow A_q, computing integer GEMM Aq×WqOintA_q \times W_q \rightarrow O_\text{int}, dequantizing to Ofp16O_\text{fp16} if needed, and continuing through the attention/FFN stacks.

4. Empirical Results: Accuracy, Speed, and Memory Trade-offs

Comprehensive evaluations on the HumanEval and MBPP code-generation benchmarks for the 7B openPangu-Embedded model (with similar trends for 1B) demonstrate:

Accuracy Retention (INT8):

CoT Mode FP16 INT8
no_think 85.37 85.37
auto_think 95.12 97.56
slow_think 95.12 95.73

INT8 achieves ≥90% of FP16 baseline accuracy across all modes. In some cases (auto_think), INT8 marginally outperforms FP16.

W4A8 Trade-offs (slow_think):

Variant Accuracy
FP16 95.12
W4A8-baseline 79.88
W4A8+SmoothQuant 93.29
W4A8+Hadamard 94.51

Baseline W4A8 imposes ~15pp accuracy drop, but enhanced calibrations (SmoothQuant, Hadamard rotation) recover most of the loss.

Latency and Memory (batch size 32):

Precision Latency (ms) Memory (GB)
FP16 45.31 30.13
INT8 30.21 (1.5×) 23.83 (−21%)

INT8 yields a 1.5× inference speedup and 21–37% memory reduction (greater savings at small batch sizes).

5. Mitigating CoT Memory and Latency Overhead

The primary value proposition of Ascend-CoT lies in extending the feasibility of slow_think and auto_think modes within stringent on-device hardware budgets:

  • Memory: Slow_think can triple activation storage; INT8/A8 quantization halves this, permitting longer and richer reasoning traces.
  • Latency: Each additional reasoning step incurs a Transformer forward pass. With INT8 GEMM, per-step compute time is reduced by 30–40% relative to FP16, making long CoT chains tractable.
  • Dynamic benefit: Auto_think mode enjoys the greatest improvement, as its dynamic fallback to slow_think no longer exacts a severe resource penalty.

For deployment scenarios with batch size 1, INT8 still achieves ~1.2× throughput and 30–35% memory savings, enabling practical, interactive CoT inference on-device.

6. Limitations and Prospective Directions

  • Accuracy-Complexity Tradeoff: W4A8 baseline presents noticeable accuracy reductions; mitigation via advanced calibration increases implementation complexity.
  • Scale Calibration: Quantization scale factors are currently calibrated offline; integrating online, per-sequence calibration could stabilize outputs during shifting runtime distributions.
  • Layer Precision Heterogeneity: The method does not yet employ mixed-precision assignment (e.g., INT8 for the majority of layers, FP16 for layers sensitive to quantization noise); automated mixed-precision search tailored to CoT workloads is an open area.
  • Hardware Co-design: Developing INT4×INT4 kernel support may lower memory/compute even further, particularly relevant for emerging multi-modal or ultra-lightweight edge deployments.

7. Summary and Significance

By encapsulating a suite of prompt-level CoT controls, post-training low-bit quantization (INT8/W4A8), and NPU-oriented kernel optimizations, Ascend-CoT facilitates high-fidelity, resource-constrained, multi-step reasoning on Ascend NPUs. The approach maintains over 90% of the baseline accuracy across all CoT paradigms, delivers up to 1.5× speedup, and reduces memory consumption by 20–40%. This positions Ascend-CoT as a practical and robust solution for deploying advanced LLM reasoning capabilities in on-device and memory-limited environments (Luo et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ascend-CoT.