DeepSeek-V2-Lite-Chat: Efficient Chat Model

Updated 9 February 2026

DeepSeek-V2-Lite-Chat is a chat-optimized large language model that integrates Mixture-of-Experts and Multi-head Latent Attention to enhance computational efficiency and accuracy.
It employs a 27-layer Transformer with 15.7B parameters (2.4B active per token) to achieve up to an 86% reduction in key-value cache memory compared to dense models.
Through rigorous supervised fine-tuning and reinforcement learning alignment, the model outperforms similar-scale dense and MoE baselines in reasoning, multilingual, and code tasks.

DeepSeek-V2-Lite-Chat is a chat-optimized LLM based on the DeepSeek-V2 architecture, incorporating Mixture-of-Experts (MoE) sparsity and Multi-head Latent Attention (MLA) mechanisms to achieve high accuracy, efficiency, and economical scaling. It is designed as a resource-efficient alternative to very large dense models, maintaining strong performance in multitask and multilingual benchmarks with a substantially reduced inference footprint and memory requirements. The model represents a derivative of the broader DeepSeek-V2 system, tailored for practical chat and reasoning applications through supervised fine-tuning (SFT) and reinforcement learning (RL) alignment (DeepSeek-AI et al., 2024).

1. Model Architecture and Parameterization

DeepSeek-V2-Lite-Chat employs a 27-layer Transformer backbone with a hidden dimension $d = 2048$ . The model’s total parameter count is 15.7 B, with only 2.4 B parameters activated per token. This results in an activation ratio $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ , leveraging a sparsely-activated MoE feedforward design for computational efficiency.

MoE Configuration:

Each Transformer layer (except the first) deploys DeepSeekMoE, consisting of $N_s=2$ shared experts and $N_r=64$ routed experts.
The FFN intermediate dimension is $d_{\mathrm{ff}} = 1408$ , and for each token, $\mathrm{TopK}_r = 6$ routed experts are selected.
Expert selection uses a softmax over linear projections of token-wise activations:

$s_{i,t} = \mathrm{Softmax}_i(u_t^T e_i), \qquad g_{i,t} = \begin{cases} s_{i,t} & \text{if %%%%0%%%% in Top-%%%%1%%%%}\ 0 & \text{otherwise} \end{cases}$

The per-token FFN update:

$h'_t = u_t + \sum_{i=1}^{N_s} \mathrm{FFN}_i^{(s)}(u_t) + \sum_{i=1}^{N_r} g_{i,t} \cdot \mathrm{FFN}_i^{(r)}(u_t)$

Context length is 4K tokens initially and is extended up to 32K tokens via the YaRN scaling scheme.

MLA and Attention:

Standard multi-head attention is replaced by MLA, with $n_h^C = 16$ heads (per-head dim $d_h = 128$ ), compressing memories to a $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 0 latent space.
The decoupled rotary position encoding applies to $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 1 heads (per-head dim $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 2), enabling robust position representation during context extension.

2. Multi-Head Latent Attention and KV Cache Compression

MLA compresses key-value (KV) memory per layer and per token into compact latent vectors, facilitating substantial reductions in inference memory.

Compression Pipeline:

For each token $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 3:

$P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 4

where $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 5.

Decoupled rotary encoding for keys and queries:

$P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 6

with $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 7.

The per-head queries and keys are concatenated as $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 8.
KV cache memory cost:
- MLA: $P_{\mathrm{act}} = 2.4\,\mathrm{B} / 15.7\,\mathrm{B} \approx 15.3\%$ 9 elements/token
- Standard MHA: $N_s=2$ 0 elements/token
- This yields an 86% reduction in KV cache requirements.

A plausible implication is that MLA’s targeted low-rank compression, with decoupled rotary embedding, is integral to scaling to long contexts and high-throughput deployment under constrained memory budgets.

3. Training Regimen and Alignment

Pretraining utilizes the DeepSeek-V2 corpus, an 8.1 T token collection, with the Lite variant trained from scratch on a subset of 5.7 T tokens to avoid SFT contamination.

Optimization Details:

AdamW is used ( $N_s=2$ 1 weight decay 0.1).
Learning rate schedule: 2k warmup steps to LR $N_s=2$ 2, with decays of ×0.316 at 80% and 90% completion.
Batch: 4,608 sequences of up to 4,096 tokens.

Supervised Fine-Tuning (SFT):

1.5 M instruction–response pairs (1.2 M for helpfulness, 0.3 M for safety)
2 epochs, LR $N_s=2$ 3, batch size $N_s=2$ 4, max sequence 2,048 tokens

Reinforcement Learning (GRPO):

Two stages: (1) alignment to code/math reasoning incentives, (2) preference-based alignment to multi-reward signals (helpfulness, safety, rule compliance).
Objective:

$N_s=2$ 5

with $N_s=2$ 6, $N_s=2$ 7 as the normalized reward.

4. Inference Efficiency and Memory Footprint

DeepSeek-V2-Lite-Chat achieves high throughput and low resource consumption during inference.

Key Properties:

Model weights use FP8 quantization; KV cache is quantized to ≈6 bits/element.
On 8×H800 GPUs:
- Full DeepSeek-V2: ≈50,000 tokens/s
- Lite-Chat: ≈130,000 tokens/s on same hardware (laboratory benchmark).
Peak GPU memory use is ≈100 GB (including all KV caches, activations, optimizer states).
Lite-Chat’s total off-GPU KV cache footprint is ≈1.2 MB/request; full V2 is ≈8.5 MB/request.

These characteristics indicate that Lite-Chat can be deployed for high-volume, low-latency chat applications and context-rich tasks, benefiting from MLA-driven cache compressibility and MoE-induced computational savings.

5. Empirical Performance and Benchmark Metrics

Lite-Chat matches or exceeds prior open-source models of 7B–16B scale across a breadth of benchmarks in zero/few-shot settings after SFT+RL alignment. The following tables summarize results (Tasks as in (DeepSeek-AI et al., 2024)):

Table 1. DeepSeek-V2-Lite-Chat versus 7B/16B Baselines

Benchmark	Shots	7B Chat	16B-MoE Chat	Lite-Chat (15.7B)
MMLU (accuracy)	5	49.7	47.2	55.7
BBH (EM)	3	43.1	42.2	48.1
TriviaQA (EM)	5	59.5	63.3	65.2
NaturalQuestions	5	32.7	35.1	35.5
ARC-Easy	25	70.2	69.9	74.3
ARC-Challenge	25	50.2	50.0	51.5
AGIEval (acc)	0	17.6	19.7	42.8
HumanEval (pass@1)	0	45.1	45.7	57.3
MBPP (pass@1)	3	39.0	46.2	45.8
GSM8K (EM)	8	62.6	62.2	72.0
MATH (EM)	4	14.7	15.2	27.9
CMath (EM)	0	66.4	67.9	71.7
CLUEWSC (acc)	5	66.2	68.2	80.0
C-Eval (acc)	5	44.7	40.0	60.1
CMMLU (acc)	5	51.2	49.3	62.5

A plausible implication is that the combined MLA+MoE architecture substantially boosts accuracy for certain knowledge-intensive and reasoning tasks at fixed inference budgets.

6. Comparative Analysis and Ablation

Ablation data without SFT/RL alignment indicate the base model (“Lite-Base (MLA+MoE)”) already outperforms dense and MoE chat baselines of similar scale in MMLU, ARC, and code/math reasoning benchmarks. Performance after SFT/RL alignment further improves robustness and safety, suggesting the training pipeline is effective in eliciting broad capabilities from the compact architecture.

Table 2. Base-model ablation (no SFT/RL)

Benchmark	Shots	7B-Dense	16B-MoE	Lite-Base (MLA+MoE)
MMLU	5	48.2	45.0	58.3
BBH	3	39.5	38.9	44.1
TriviaQA	5	59.7	64.8	64.2
HumanEval	0	26.2	26.8	29.9
GSM8K	8	17.4	18.8	41.1
MATH	4	3.3	4.3	17.1
C-Eval	5	45.0	40.6	60.3
CMMLU	5	47.2	42.5	64.3

This suggests the MLA+MoE architectural innovations are critical even prior to downstream alignment.

7. Summary and Significance

DeepSeek-V2-Lite-Chat exemplifies the impact of integrating MLA and DeepSeekMoE to create models with strong task accuracy, memory efficiency, and computational economy. With only 15.7 B parameters and 2.4 B activated per token, Lite-Chat achieves up to 86% smaller KV cache, over 2× the throughput of dense 67 B models, and supports 32K context lengths, while delivering top-tier performance across language, code, reasoning, and multilingual benchmarks (DeepSeek-AI et al., 2024). This positions Lite-Chat as a preferred solution for resource-conscious deployment in production chat and complex reasoning pipelines.

Markdown Report Issue Upgrade to Chat

References (1)

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepSeek-V2-Lite-Chat.