DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%
DeepSeek‑V2 is a 236‑billion‑parameter mixture‑of‑experts language model that reduces training cost by 42.5 %, cuts KV‑cache usage by 93.3 %, and boosts generation throughput 5.76×, while achieving state‑of‑the‑art scores on benchmarks such as MMLU, C‑Eval, BBH, HumanEval, and GSM8K for both base and chat variants.
Introduction
DeepSeek‑V2 is a powerful expert mixture‑of‑experts (MoE) language model containing 236 billion parameters. It activates 21 billion tokens per inference step and achieves stronger performance than the 67 billion‑parameter Dense‑67B baseline while saving 42.5 % of training cost, reducing KV‑cache memory by 93.3 %, and increasing maximum generation throughput by 5.76×.
Model Architecture
The model adopts two key innovations to keep training economical and inference efficient:
IEAttn: a low‑rank key‑value joint compression technique that eliminates the KV‑cache bottleneck during inference, enabling long context windows.
DeepSeekMoE: a high‑performance MoE feed‑forward network that reduces training cost while scaling model capacity.
Training and Efficiency
The model was pretrained on a diverse 8.1‑trillion‑token corpus and subsequently refined with supervised fine‑tuning (SFT) and reinforcement learning (RL). The training pipeline demonstrates that MoE scaling can deliver stronger models with substantially lower compute and memory requirements.
Evaluation Results – Base Model
DeepSeek‑V2 (base) was benchmarked on a wide range of tasks. Representative scores include:
MMLU (English): 78.5
BBH (English): 78.9
C‑Eval (Chinese): 81.7
CMMLU (Chinese): 84.0
HumanEval (code): 40.9
GSM8K (math): 79.2
Math (general): 43.6
These results surpass the Dense‑67B baseline and are competitive with larger proprietary models such as LLaMA‑3 70B and Mixtral 8×22B.
Evaluation Results – Chat Model
The chat‑tuned variant (DeepSeek‑V2‑Chat RL) was evaluated on AlpacaEval 2.0, MTBench, and several open‑source benchmarks. Key scores are:
MMLU (English): 77.8
BBH (English): 79.7
C‑Eval (Chinese): 78.0
CMMLU (Chinese): 81.6
HumanEval (code): 81.1
GSM8K (math): 92.2
Math (general): 53.9
In the Needle‑In‑A‑Haystack (NIAH) test, the model maintains strong performance up to a 128 k token context window.
Chinese Open‑Ended Generation
When compared with other open‑source and closed‑source models on a Chinese open‑ended generation benchmark, DeepSeek‑V2‑Chat (RL) achieved an overall score of 7.91, with 7.45 on Chinese reasoning and 8.36 on Chinese language generation, closely matching GPT‑4‑1106‑preview (8.01 overall).
LiveCodeBench Coding Benchmark
On the LiveCodeBench (0901‑0401) real‑time coding benchmark, DeepSeek‑V2 attained a Pass@1 score of 66.6, outperforming many comparable models and demonstrating effective handling of live coding tasks.
Usage and Licensing
The model can be run locally with BF16 precision on eight 80 GB GPUs. Hugging Face Transformers provides a ready‑to‑use inference pipeline. Example code for text completion and chat completion is provided below.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
max_memory = {i: "75GB" for i in range(8)}
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
max_memory=max_memory,
)
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
text = "An attention function can be described as mapping a query and a set of key‑value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)The codebase is released under the MIT license, and both the base and chat models are permitted for commercial use.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
