Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepSeek-V3 Technical Report

Published 27 Dec 2024 in cs.CL and cs.AI | (2412.19437v2)

Abstract: We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) LLM with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Summary

  • The paper introduces a 671B-parameter MoE LLM featuring 37B active parameters per token with an innovative auxiliary-loss-free load balancing mechanism.
  • It leverages Multi-head Latent Attention (MLA) and Multi-token Prediction (MTP) to reduce memory usage and densify the training signal.
  • The report demonstrates state-of-the-art results on knowledge, math, and code benchmarks using FP8 mixed precision training and robust long-context modeling.

DeepSeek-V3: Architecture, Training, and Evaluation of a 671B-Parameter MoE LLM

Introduction and Motivation

DeepSeek-V3 represents a significant scaling and architectural advance in open-source LLMs, featuring a 671B-parameter Mixture-of-Experts (MoE) design with 37B active parameters per token. The model is trained on 14.8T tokens and incorporates several architectural and systems-level innovations to achieve high performance, cost efficiency, and robust long-context capabilities. The report details the model's architecture, training pipeline, infrastructure optimizations, and comprehensive evaluation, positioning DeepSeek-V3 as a leading open-source alternative competitive with closed-source models. Figure 1

Figure 1: Benchmark performance of DeepSeek-V3 and its counterparts.

Model Architecture

Core Design: MLA and DeepSeekMoE

DeepSeek-V3 builds upon the Transformer backbone, integrating Multi-head Latent Attention (MLA) and DeepSeekMoE for efficient inference and economical training, respectively. MLA reduces the key-value (KV) cache size during inference by compressing keys and values into low-rank representations, which are then up-projected as needed. This design maintains comparable performance to standard Multi-Head Attention while significantly reducing memory and bandwidth requirements.

DeepSeekMoE employs fine-grained expert routing, with each MoE layer comprising 1 shared and 256 routed experts. For each token, 8 experts are activated, and routing is constrained to a maximum of 4 nodes to minimize communication overhead. Notably, DeepSeek-V3 introduces an auxiliary-loss-free load balancing strategy, replacing traditional auxiliary loss with a dynamic bias-based mechanism that adjusts expert selection to maintain balanced loads without degrading model performance. Figure 2

Figure 2: Illustration of the basic architecture of DeepSeek-V3, highlighting MLA and DeepSeekMoE for efficient inference and training.

Multi-Token Prediction (MTP)

DeepSeek-V3 incorporates a Multi-Token Prediction (MTP) objective, extending the training signal by predicting multiple future tokens at each position. Unlike prior approaches that use parallel output heads, DeepSeek-V3's MTP implementation maintains the full causal chain for each prediction depth, using sequential modules that share embeddings and output heads with the main model. This densifies the training signal and enables speculative decoding for inference acceleration. Figure 3

Figure 3: Illustration of the Multi-Token Prediction (MTP) implementation, maintaining the complete causal chain for each token at each depth.

Load Balancing and Expert Specialization

The auxiliary-loss-free load balancing strategy dynamically adjusts per-expert biases based on observed load, ensuring batch-wise balance without imposing strong sequence-wise constraints. This approach enables greater expert specialization, as evidenced by domain-specific load patterns, and consistently yields superior performance compared to auxiliary-loss-based methods. Figure 4

Figure 4: Expert load comparison between auxiliary-loss-free and auxiliary-loss-based models, showing greater specialization in the former.

Training Infrastructure and Systems Optimizations

Distributed Training and DualPipe

DeepSeek-V3 is trained on a 2048 H800 GPU cluster, leveraging a custom HAI-LLM framework. The training pipeline employs 16-way pipeline parallelism (PP), 64-way expert parallelism (EP) across 8 nodes, and ZeRO-1 data parallelism. The DualPipe algorithm is introduced to maximize computation-communication overlap, reducing pipeline bubbles and hiding all-to-all and PP communication behind computation. Figure 5

Figure 5: Overlapping strategy for forward and backward chunks, fully hiding all-to-all and PP communication.

Figure 6

Figure 6: Example DualPipe scheduling for 8 PP ranks and 20 micro-batches, illustrating overlapped computation and communication.

FP8 Mixed Precision Training

A fine-grained FP8 mixed precision framework is developed and validated at scale. Most GEMM operations are performed in FP8, with critical components (e.g., embeddings, output heads, normalization) retained in higher precision. Fine-grained quantization (tile-wise for activations, block-wise for weights) and high-precision accumulation (promotion to CUDA cores at 128-element intervals) are employed to mitigate quantization errors and maintain training stability. The framework also compresses cached activations and optimizer states to reduce memory and communication overhead. Figure 7

Figure 7: Mixed precision framework with FP8 data format, showing the Linear operator.

Figure 8

Figure 8: (a) Fine-grained quantization to mitigate outlier-induced errors; (b) Improved FP8 GEMM precision via high-precision accumulation.

Loss curves demonstrate that FP8 training achieves a relative error below 0.25% compared to BF16, validating the approach for large-scale LLMs.

Data, Pre-Training, and Long-Context Extension

The pre-training corpus is constructed to enhance mathematical, programming, and multilingual content, with document packing and Fill-in-Middle (FIM) strategies to improve data integrity and modeling capabilities. The tokenizer uses byte-level BPE with a 128K vocabulary, optimized for multilingual compression.

Long-context capability is achieved via YaRN-based context extension, with two post-pretraining phases expanding the context window from 4K to 32K and then to 128K. DeepSeek-V3 maintains robust performance on the "Needle In A Haystack" (NIAH) benchmark across all context lengths. Figure 9

Figure 9: NIAH evaluation results, demonstrating robust performance up to 128K context length.

Post-Training: Alignment and Distillation

Supervised fine-tuning (SFT) and reinforcement learning (RL) are applied post-pretraining. Reasoning data is distilled from DeepSeek-R1 models, incorporating verification and reflection patterns to enhance reasoning performance while controlling output style and length. RL employs Group Relative Policy Optimization (GRPO), using both rule-based and model-based reward models, and leverages self-rewarding via LLM-based voting for open-ended tasks.

Evaluation and Results

Comprehensive evaluation across knowledge, code, math, reasoning, and multilingual benchmarks demonstrates that DeepSeek-V3 outperforms all open-source models and is competitive with leading closed-source models such as GPT-4o and Claude-3.5-Sonnet. Notably, DeepSeek-V3 achieves:

  • 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA, matching or surpassing closed-source models in educational and factual knowledge.
  • State-of-the-art results on math (e.g., MATH-500, AIME, CNMO) and code (e.g., HumanEval, LiveCodeBench) benchmarks, with substantial margins over prior open-source models.
  • Robust long-context performance and strong expert specialization, enabled by architectural and training innovations.

Implications and Future Directions

DeepSeek-V3 demonstrates that large-scale MoE LLMs can achieve high performance and efficiency through architectural, algorithmic, and systems co-design. The auxiliary-loss-free load balancing and MTP objectives are shown to be effective at scale, and the FP8 training framework sets a precedent for low-precision training in trillion-token regimes. The report also provides actionable hardware design recommendations, including support for fine-grained quantization, higher-precision accumulation, and communication offloading.

The model's open-source release, combined with its strong performance and cost efficiency (2.788M H800 GPU hours for full training), is expected to accelerate research and deployment of large-scale LLMs. Future work will focus on further architectural improvements, data scaling, deep reasoning capabilities, and more comprehensive evaluation methodologies.

Conclusion

DeepSeek-V3 establishes a new standard for open-source LLMs, combining a scalable MoE architecture, efficient training and inference, and robust alignment strategies. The innovations in load balancing, multi-token prediction, and FP8 training are validated at unprecedented scale, yielding a model that is both performant and accessible. The work provides a blueprint for future LLM development, emphasizing the importance of holistic optimization across model, algorithm, and hardware layers.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper introduces DeepSeek‑V3, a very large AI LLM. Think of it like a super smart writing and problem‑solving assistant. The team’s goal was to make it both powerful and affordable to train and use. They did this with clever design ideas that let the model work faster, use less memory, and cost less—without losing accuracy.

The main goals and questions

The researchers focused on a few clear goals:

  • Can we build a huge open‑source model that rivals top closed models while costing much less to train?
  • Can we make “mixture‑of‑experts” (a model made of many specialist mini‑models) work smoothly without wasting time or power?
  • Can we train using lower‑precision numbers (FP8) to go faster and use less memory, but still keep accuracy?
  • Can we teach the model to “plan ahead” by predicting several upcoming words at once to boost performance?
  • Can we scale up training to massive data (14.8 trillion tokens) and super long inputs (up to 128,000 tokens) while keeping training stable?

How they did it (methods explained simply)

To reach those goals, they combined several ideas. Here’s what each means in everyday language:

1) Mixture‑of‑Experts (MoE): a team of specialists

Instead of one giant brain doing everything, the model has many “experts,” each good at certain kinds of tasks. For every piece of text (a “token”), the system picks a few experts to handle it, like asking the right specialist for help. This saves time and makes the model smarter.

  • Load balancing: They made sure no expert gets overloaded or ignored. Instead of punishing the model with extra “auxiliary losses” (which can hurt accuracy), they adjust a small “bias” for each expert to gently steer traffic where it’s needed—like a smart traffic light keeping cars moving evenly across lanes.
  • Node‑limited routing: During training, tokens are allowed to travel to only a few machines (nodes). This keeps communication costs low and training fast.
  • No token dropping: Because the traffic is well balanced, they don’t have to throw away tokens when things get busy.

2) Multi‑head Latent Attention (MLA): keeping smaller “notes”

When the model pays attention to earlier words, it has to store “keys” and “values,” which are like notes about the past words. MLA compresses those notes so they’re much smaller—but still useful. That makes the model faster and cheaper to run, especially for long texts.

3) Multi‑Token Prediction (MTP): planning ahead

Usually, models learn to predict just the next word. This model also practices predicting the next several words in sequence. It’s like chess: don’t just think about the next move—think a few moves ahead. This gives the model stronger learning signals and can boost performance. These extra prediction modules can be turned off at runtime (or reused to speed up generation).

4) FP8 mixed‑precision training: lower‑resolution math, same picture

Computers store numbers with a certain number of bits. Using fewer bits (FP8) makes things faster and uses less memory—like lowering video resolution so it streams smoothly—but it can blur details. To keep the “picture” sharp:

  • They use fine‑grained scaling, which treats small groups of numbers separately so outliers don’t mess things up.
  • They do careful, higher‑precision “adding up” steps in the background so calculations stay accurate.
  • Some sensitive parts (like the embedding layer, output head, and attention math) still use higher precision to stay stable.

5) Training at scale with smart scheduling (DualPipe)

Training a model across thousands of GPUs is like running a factory assembly line. DualPipe is their improved schedule that overlaps “thinking,” “talking to other machines,” and “learning from mistakes” so almost nothing sits idle. It hides slow communication by doing useful computing at the same time—like cooking the pasta while the sauce is simmering—so dinner is ready faster.

  • They also wrote custom communication code to use their hardware’s fast connections (NVLink inside a machine and InfiniBand between machines) as efficiently as possible.

6) Data and post‑training

  • They pre‑trained on 14.8 trillion tokens of high‑quality, diverse text.
  • They extended the model’s memory (context length) in two steps: first to 32K tokens, then to 128K.
  • After pre‑training, they did supervised fine‑tuning and reinforcement learning to better match human preferences.
  • They distilled “reasoning” styles (like verify and reflect) from their DeepSeek‑R1 models into DeepSeek‑V3 without making the answers unnecessarily long.

What they found (key results and why they matter)

Here are the main takeaways:

  • Strong performance: The base model is the strongest open‑source base model they tested, especially in coding and math.
  • Competitive with top closed models: The chat version performs similarly to leading closed‑source models like GPT‑4o and Claude 3.5 Sonnet on many benchmarks.
  • Great at knowledge tasks:
    • High scores on academic tests like MMLU and MMLU‑Pro.
    • Very strong factual knowledge, especially in Chinese.
  • Excellent at coding and math:
    • Top results on math benchmarks (even beating certain specialized models on some tests).
    • Best on coding competition benchmarks like LiveCodeBench, and very strong on engineering tasks.
  • Efficient and stable training:
    • Full training took about 2.788 million H800 GPU hours. At $2/hour, that’s about$5.6 million—cheap for a model this size.
    • Training was smooth: no big crashes or resets.
    • FP8 training worked well at huge scale, with accuracy staying within normal randomness.

Why this matters (impact and what’s next)

  • Cheaper, faster AI: The techniques (MLA, MoE load balancing without extra penalties, MTP, FP8 training, DualPipe scheduling) show we can train giant models more efficiently. That opens the door to more researchers and companies building powerful models without massive budgets.
  • Better open‑source options: DeepSeek‑V3 narrows the gap with top closed models, giving the community strong, transparent tools for research and real‑world use.
  • Smarter long‑context understanding: Handling up to 128K tokens means the model can read and reason over long documents, codebases, or transcripts more effectively.
  • Practical reasoning: Distilling “verify and reflect” behaviors from a reasoning model helps V3 think more carefully without always producing very long answers.
  • Hardware‑software co‑design: Their communication tricks and FP8 methods could inspire future GPU designs and training frameworks, making big models even more accessible.

In short, this paper shows how to build a huge, smart, and efficient AI model by combining many clever engineering and training ideas—making powerful AI more affordable and widely available.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 221 tweets with 12464 likes about this paper.

HackerNews

  1. DeepSeek-V3 Technical Report (132 points, 34 comments) 
  2. DeepSeek-V3 (124 points, 39 comments) 
  3. DeepSeek-V3 Technical Report (3 points, 0 comments)