DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

DeepSeek‑V2 is a 236‑billion‑parameter mixture‑of‑experts language model that reduces training cost by 42.5 %, cuts KV‑cache usage by 93.3 %, and boosts generation throughput 5.76×, while achieving state‑of‑the‑art scores on benchmarks such as MMLU, C‑Eval, BBH, HumanEval, and GSM8K for both base and chat variants.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

Introduction

DeepSeek‑V2 is a powerful expert mixture‑of‑experts (MoE) language model containing 236 billion parameters. It activates 21 billion tokens per inference step and achieves stronger performance than the 67 billion‑parameter Dense‑67B baseline while saving 42.5 % of training cost, reducing KV‑cache memory by 93.3 %, and increasing maximum generation throughput by 5.76×.

Model overview
Model overview

Model Architecture

The model adopts two key innovations to keep training economical and inference efficient:

IEAttn: a low‑rank key‑value joint compression technique that eliminates the KV‑cache bottleneck during inference, enabling long context windows.

DeepSeekMoE: a high‑performance MoE feed‑forward network that reduces training cost while scaling model capacity.

Architecture diagram
Architecture diagram

Training and Efficiency

The model was pretrained on a diverse 8.1‑trillion‑token corpus and subsequently refined with supervised fine‑tuning (SFT) and reinforcement learning (RL). The training pipeline demonstrates that MoE scaling can deliver stronger models with substantially lower compute and memory requirements.

Evaluation Results – Base Model

DeepSeek‑V2 (base) was benchmarked on a wide range of tasks. Representative scores include:

MMLU (English): 78.5

BBH (English): 78.9

C‑Eval (Chinese): 81.7

CMMLU (Chinese): 84.0

HumanEval (code): 40.9

GSM8K (math): 79.2

Math (general): 43.6

These results surpass the Dense‑67B baseline and are competitive with larger proprietary models such as LLaMA‑3 70B and Mixtral 8×22B.

Evaluation Results – Chat Model

The chat‑tuned variant (DeepSeek‑V2‑Chat RL) was evaluated on AlpacaEval 2.0, MTBench, and several open‑source benchmarks. Key scores are:

MMLU (English): 77.8

BBH (English): 79.7

C‑Eval (Chinese): 78.0

CMMLU (Chinese): 81.6

HumanEval (code): 81.1

GSM8K (math): 92.2

Math (general): 53.9

In the Needle‑In‑A‑Haystack (NIAH) test, the model maintains strong performance up to a 128 k token context window.

Chinese Open‑Ended Generation

When compared with other open‑source and closed‑source models on a Chinese open‑ended generation benchmark, DeepSeek‑V2‑Chat (RL) achieved an overall score of 7.91, with 7.45 on Chinese reasoning and 8.36 on Chinese language generation, closely matching GPT‑4‑1106‑preview (8.01 overall).

LiveCodeBench Coding Benchmark

On the LiveCodeBench (0901‑0401) real‑time coding benchmark, DeepSeek‑V2 attained a Pass@1 score of 66.6, outperforming many comparable models and demonstrating effective handling of live coding tasks.

Usage and Licensing

The model can be run locally with BF16 precision on eight 80 GB GPUs. Hugging Face Transformers provides a ready‑to‑use inference pipeline. Example code for text completion and chat completion is provided below.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 max_memory = {i: "75GB" for i in range(8)}
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
     trust_remote_code=True,
     device_map="auto",
     torch_dtype=torch.bfloat16,
     max_memory=max_memory,
 )
 model.generation_config = GenerationConfig.from_pretrained(model_name)
 model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key‑value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
 result = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(result)

The codebase is released under the MIT license, and both the base and chat models are permitted for commercial use.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIMixture of Expertslarge language modelDeepSeek-V2
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.