Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepSeek-V2-Lite-Chat: Efficient Chat Model

Updated 9 February 2026
  • DeepSeek-V2-Lite-Chat is a chat-optimized large language model that integrates Mixture-of-Experts and Multi-head Latent Attention to enhance computational efficiency and accuracy.
  • It employs a 27-layer Transformer with 15.7B parameters (2.4B active per token) to achieve up to an 86% reduction in key-value cache memory compared to dense models.
  • Through rigorous supervised fine-tuning and reinforcement learning alignment, the model outperforms similar-scale dense and MoE baselines in reasoning, multilingual, and code tasks.

DeepSeek-V2-Lite-Chat is a chat-optimized LLM based on the DeepSeek-V2 architecture, incorporating Mixture-of-Experts (MoE) sparsity and Multi-head Latent Attention (MLA) mechanisms to achieve high accuracy, efficiency, and economical scaling. It is designed as a resource-efficient alternative to very large dense models, maintaining strong performance in multitask and multilingual benchmarks with a substantially reduced inference footprint and memory requirements. The model represents a derivative of the broader DeepSeek-V2 system, tailored for practical chat and reasoning applications through supervised fine-tuning (SFT) and reinforcement learning (RL) alignment (DeepSeek-AI et al., 2024).

1. Model Architecture and Parameterization

DeepSeek-V2-Lite-Chat employs a 27-layer Transformer backbone with a hidden dimension d=2048d = 2048. The model’s total parameter count is 15.7 B, with only 2.4 B parameters activated per token. This results in an activation ratio Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%, leveraging a sparsely-activated MoE feedforward design for computational efficiency.

MoE Configuration:

  • Each Transformer layer (except the first) deploys DeepSeekMoE, consisting of Ns=2N_s=2 shared experts and Nr=64N_r=64 routed experts.
  • The FFN intermediate dimension is dff=1408d_{\mathrm{ff}} = 1408, and for each token, TopKr=6\mathrm{TopK}_r = 6 routed experts are selected.
  • Expert selection uses a softmax over linear projections of token-wise activations:

$s_{i,t} = \mathrm{Softmax}_i(u_t^T e_i), \qquad g_{i,t} = \begin{cases} s_{i,t} & \text{if %%%%0%%%% in Top-%%%%1%%%%}\ 0 & \text{otherwise} \end{cases}$

  • The per-token FFN update:

ht=ut+i=1NsFFNi(s)(ut)+i=1Nrgi,tFFNi(r)(ut)h'_t = u_t + \sum_{i=1}^{N_s} \mathrm{FFN}_i^{(s)}(u_t) + \sum_{i=1}^{N_r} g_{i,t} \cdot \mathrm{FFN}_i^{(r)}(u_t)

  • Context length is 4K tokens initially and is extended up to 32K tokens via the YaRN scaling scheme.

MLA and Attention:

  • Standard multi-head attention is replaced by MLA, with nhC=16n_h^C = 16 heads (per-head dim dh=128d_h = 128), compressing memories to a Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%0 latent space.
  • The decoupled rotary position encoding applies to Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%1 heads (per-head dim Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%2), enabling robust position representation during context extension.

2. Multi-Head Latent Attention and KV Cache Compression

MLA compresses key-value (KV) memory per layer and per token into compact latent vectors, facilitating substantial reductions in inference memory.

Compression Pipeline:

  • For each token Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%3:

Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%4

where Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%5.

Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%6

with Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%7.

  • The per-head queries and keys are concatenated as Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%8.
  • KV cache memory cost:
    • MLA: Pact=2.4B/15.7B15.3%P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%9 elements/token
    • Standard MHA: Ns=2N_s=20 elements/token
    • This yields an 86% reduction in KV cache requirements.

A plausible implication is that MLA’s targeted low-rank compression, with decoupled rotary embedding, is integral to scaling to long contexts and high-throughput deployment under constrained memory budgets.

3. Training Regimen and Alignment

Pretraining utilizes the DeepSeek-V2 corpus, an 8.1 T token collection, with the Lite variant trained from scratch on a subset of 5.7 T tokens to avoid SFT contamination.

Optimization Details:

  • AdamW is used (Ns=2N_s=21 weight decay 0.1).
  • Learning rate schedule: 2k warmup steps to LRNs=2N_s=22, with decays of ×0.316 at 80% and 90% completion.
  • Batch: 4,608 sequences of up to 4,096 tokens.

Supervised Fine-Tuning (SFT):

  • 1.5 M instruction–response pairs (1.2 M for helpfulness, 0.3 M for safety)
  • 2 epochs, LR Ns=2N_s=23, batch size Ns=2N_s=24, max sequence 2,048 tokens

Reinforcement Learning (GRPO):

  • Two stages: (1) alignment to code/math reasoning incentives, (2) preference-based alignment to multi-reward signals (helpfulness, safety, rule compliance).
  • Objective:

Ns=2N_s=25

with Ns=2N_s=26, Ns=2N_s=27 as the normalized reward.

4. Inference Efficiency and Memory Footprint

DeepSeek-V2-Lite-Chat achieves high throughput and low resource consumption during inference.

Key Properties:

  • Model weights use FP8 quantization; KV cache is quantized to ≈6 bits/element.
  • On 8×H800 GPUs:
    • Full DeepSeek-V2: ≈50,000 tokens/s
    • Lite-Chat: ≈130,000 tokens/s on same hardware (laboratory benchmark).
  • Peak GPU memory use is ≈100 GB (including all KV caches, activations, optimizer states).
  • Lite-Chat’s total off-GPU KV cache footprint is ≈1.2 MB/request; full V2 is ≈8.5 MB/request.

These characteristics indicate that Lite-Chat can be deployed for high-volume, low-latency chat applications and context-rich tasks, benefiting from MLA-driven cache compressibility and MoE-induced computational savings.

5. Empirical Performance and Benchmark Metrics

Lite-Chat matches or exceeds prior open-source models of 7B–16B scale across a breadth of benchmarks in zero/few-shot settings after SFT+RL alignment. The following tables summarize results (Tasks as in (DeepSeek-AI et al., 2024)):

Table 1. DeepSeek-V2-Lite-Chat versus 7B/16B Baselines

Benchmark Shots 7B Chat 16B-MoE Chat Lite-Chat (15.7B)
MMLU (accuracy) 5 49.7 47.2 55.7
BBH (EM) 3 43.1 42.2 48.1
TriviaQA (EM) 5 59.5 63.3 65.2
NaturalQuestions 5 32.7 35.1 35.5
ARC-Easy 25 70.2 69.9 74.3
ARC-Challenge 25 50.2 50.0 51.5
AGIEval (acc) 0 17.6 19.7 42.8
HumanEval (pass@1) 0 45.1 45.7 57.3
MBPP (pass@1) 3 39.0 46.2 45.8
GSM8K (EM) 8 62.6 62.2 72.0
MATH (EM) 4 14.7 15.2 27.9
CMath (EM) 0 66.4 67.9 71.7
CLUEWSC (acc) 5 66.2 68.2 80.0
C-Eval (acc) 5 44.7 40.0 60.1
CMMLU (acc) 5 51.2 49.3 62.5

A plausible implication is that the combined MLA+MoE architecture substantially boosts accuracy for certain knowledge-intensive and reasoning tasks at fixed inference budgets.

6. Comparative Analysis and Ablation

Ablation data without SFT/RL alignment indicate the base model (“Lite-Base (MLA+MoE)”) already outperforms dense and MoE chat baselines of similar scale in MMLU, ARC, and code/math reasoning benchmarks. Performance after SFT/RL alignment further improves robustness and safety, suggesting the training pipeline is effective in eliciting broad capabilities from the compact architecture.

Table 2. Base-model ablation (no SFT/RL)

Benchmark Shots 7B-Dense 16B-MoE Lite-Base (MLA+MoE)
MMLU 5 48.2 45.0 58.3
BBH 3 39.5 38.9 44.1
TriviaQA 5 59.7 64.8 64.2
HumanEval 0 26.2 26.8 29.9
GSM8K 8 17.4 18.8 41.1
MATH 4 3.3 4.3 17.1
C-Eval 5 45.0 40.6 60.3
CMMLU 5 47.2 42.5 64.3

This suggests the MLA+MoE architectural innovations are critical even prior to downstream alignment.

7. Summary and Significance

DeepSeek-V2-Lite-Chat exemplifies the impact of integrating MLA and DeepSeekMoE to create models with strong task accuracy, memory efficiency, and computational economy. With only 15.7 B parameters and 2.4 B activated per token, Lite-Chat achieves up to 86% smaller KV cache, over 2× the throughput of dense 67 B models, and supports 32K context lengths, while delivering top-tier performance across language, code, reasoning, and multilingual benchmarks (DeepSeek-AI et al., 2024). This positions Lite-Chat as a preferred solution for resource-conscious deployment in production chat and complex reasoning pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepSeek-V2-Lite-Chat.