openPangu-Embedded Models on Ascend NPUs
- openPangu-Embedded models are specialized large language models optimized for low-bit, on-device inference with integrated chain-of-thought reasoning modes.
- They employ decoder-only Transformer architectures alongside INT8 and W4A8 quantization techniques to enhance speed, reduce memory footprint, and ensure hardware efficiency on Atlas A2 NPUs.
- Systematic co-design of quantization strategies, model architecture, and inference pipelines preserves over 90% of FP16 accuracy while addressing practical deployment challenges.
openPangu-Embedded models are variants of the openPangu LLM family specifically engineered for efficient, high-fidelity on-device inference, particularly on Huawei’s Ascend NPUs (notably the Atlas A2 accelerator). These models (openPangu-Embedded-1B and openPangu-Embedded-7B) incorporate decoder-only Transformer architectures with built-in support for three Chain-of-Thought (CoT) reasoning paradigms and are optimized for low-bit (INT8 and W4A8) quantized inference utilizing the CATLASS operator library. Through systematic co-design of quantization strategies, model architecture, and inference pipeline, openPangu-Embedded models enable practical deployment of advanced LLMs on resource-constrained hardware, preserving >90% of the floating-point baseline accuracy and achieving marked improvements in speed and memory footprint (Luo et al., 29 Dec 2025).
1. Model Architecture and Chain-of-Thought Paradigms
openPangu-Embedded-1B and openPangu-Embedded-7B are decoder-only Transformer variants differentiated primarily by parameter count and depth: the 1B model comprises approximately 1 billion parameters across 24 layers, while the 7B model comprises approximately 7 billion parameters across 32 layers. Both models utilize standard multi-head attention, feed-forward blocks, GELU activations, and layer normalization; architectural differences are limited to hidden width and depth.
A central feature is the explicit integration of three CoT reasoning paradigms:
- no_think: Direct generation of answers with minimal intermediate steps, resulting in the lowest compute and memory requirements.
- slow_think: Emission of detailed, step-by-step reasoning traces, generating longer output sequences, with up to 2–3× larger activation buffers due to storage of intermediate chains and a corresponding increase in computational cost.
- auto_think: Dynamic, token-wise control: the model at each step determines whether to output additional reasoning traces (“think slowly”) or proceed directly (“think fast”), with resource overheads adaptive to prompt complexity.
These CoT modes directly inform compute and memory requirements by modulating intermediate token generation and activation storage during inference.
2. Low-Bit Quantization and Unified Inference Framework
To address the computational overhead of extended CoT traces on resource-constrained NPUs, openPangu-Embedded models adopt a unified low-bit quantization framework within Huawei’s CATLASS operator library, supporting both INT8 (W8A8: 8-bit weights and activations) and mixed W4A8 (4-bit weights, 8-bit activations) inference. Key optimizations include:
- NPU-friendly tensor block layouts (e.g., 16×64 blocks) to maximize arithmetic density and datapath utilization.
- On-chip caching, minimizing DRAM traffic by staging quantized tiles in the L1/L2 memory hierarchy.
- Single-pass dequantization and compute: for INT8, scale factors are fused into GEMM operations; for W4A8, weights are unpacked, dequantized to INT8 on-the-fly, and passed through INT8 GEMM kernels without conversion to floating point.
All quantization is symmetric, using
with . For W4A8, two calibration-aware preprocessing steps are deployed:
- SmoothQuant ( blending exponent), which interpolates scaling between activations and weights.
- Hadamard rotation, applying orthogonal integer transforms to stabilize quantization.
3. Performance, Accuracy, and Trade-Offs
Systematic benchmarking on HumanEval and MBPP code generation benchmarks demonstrates that INT8 quantization preserves over 90% of FP16 baseline accuracy. For openPangu-Embedded-7B, INT8 quantization yields no drop in HumanEval or MBPP accuracy in no_think mode (85.37/77.04 → 85.37/78.21), and in some cases auto_think/slow_think modes demonstrate marginal gains (+2.44 HumanEval, +3.11 MBPP for auto_think). openPangu-Embedded-1B experiences a modest degradation (e.g., −4.27 HE in no_think mode) but also cases of improvement under auto_think.
W4A8 quantization results in greater accuracy loss (e.g., −4.27 HE and −7.00 MBPP for 7B/no_think), but SmoothQuant and Hadamard rotation recover a significant fraction of MBPP performance (≈5 points).
Quantization reduces memory footprint and increases speed. For the 7B model, INT8 yields 1.16–1.60× prefill speedup and 13.9–37.3% memory reduction dependent on batch size, with the greatest gains at smaller batches. W4A8 variants yield even larger memory savings at the cost of moderate accuracy trade-off.
| Model | Mode | FP16 Acc. (HE/MBPP) | INT8 Acc. (HE/MBPP) | W4A8 Acc. (HE/MBPP) |
|---|---|---|---|---|
| 7B | no_think | 85.37 / 77.04 | 85.37 / 78.21 | 81.10 / 70.04 |
| 7B | auto_think | 92.68 / 80.16 | 95.12 / 83.27 | – |
| 7B | slow_think | 95.12 / 77.43 | 95.73 / 79.77 | – |
| 1B | no_think | 70.73 / 58.75 | 66.46 / 58.37 | – |
| 1B | auto_think | 67.07 / 60.70 | 70.12 / 65.37 | – |
| 1B | slow_think | 65.24 / 61.87 | 66.46 / 62.26 | – |
4. Atlas A2 Hardware Optimization and Deployment
Efficient deployment on the Ascend Atlas A2 NPU leverages several hardware-aware design considerations:
- Integer ALUs and DMA engines are utilized via tightly blocked GEMM tiles; generic GPU-style quantization libraries are suboptimal due to low utilization and requisite data format conversions.
- CATLASS templates have been extended to decode W4A8 in hardware, eliminating host-side overheads.
- Model export now includes per-tensor (bit-width, scale) annotations that are parsed by the inference runtime loader to configure low-bit caches and dynamically dispatch GEMM variants.
- Activation quantization scales are computed per token at inference and fused into self-attention and FFN kernels.
Pipeline modifications enable the models to run fully in low-precision integer arithmetic, minimizing FP16 conversion overhead and maximizing NPU throughput.
5. Practical Deployment Challenges and Calibration
W4A8 quantization necessitates post-training scale calibration on real data. Omitting this step results in accuracy losses exceeding 10–15 points. As slow_think modes produce extended reasoning traces, the resulting memory pressure may exceed the NPU’s SRAM (48 GB for Atlas A2); careful scheduling and buffer management are required to ensure inference remains within hardware constraints.
A plausible implication is that further scaling of “slow_think” sequence lengths or more aggressive quantization would require both hardware and algorithmic adaptations in calibration, memory layout, and operator design.
6. Implications and Summary of Findings
By co-designing low-bit quantization, efficient integer GEMM kernels, and memory-efficient layouts within the CATLASS library, openPangu-Embedded-1B/7B models can perform end-to-end low-precision inference on the Atlas A2 while retaining all three CoT reasoning modes. This approach maintains more than 90% of FP16 model accuracy (with selected cases showing improved task accuracy), achieves 1.2–1.6× speedup in prefill throughput, and attains up to 40% reduction in memory consumption for small batch sizes. The retention of robust “no_think,” “auto_think,” and “slow_think” capabilities under quantization establishes openPangu-Embedded as a practical foundation for advanced, efficient on-device LLM deployment (Luo et al., 29 Dec 2025).