Building a Mini‑DeepSeek‑V3: Transformer Block and MTP Implementation on Limited Compute
This article walks through the design and implementation of a Mini‑DeepSeek‑V3 language model, detailing how to assemble the core Transformer block, integrate Multi‑Token Prediction (MTP) modules, construct the overall architecture, and compute the combined loss—all using modest GPU resources and a single‑card or DDP training setup.
Overview
Mini‑DeepSeek‑V3 reproduces the core architecture of DeepSeek‑V3 (MLA, DeepSeekMoE, MTP, auxiliary‑loss‑free load‑balancing and sequence‑level auxiliary loss) while fitting on a single GPU or a small DDP cluster. It uses the original RoPE positional encoding and does not employ YaRN for context‑length extension.
Prerequisites
Python ≥ 3.8, PyTorch ≥ 2.0, CUDA‑compatible GPU.
DeepSeek‑MoE implementation (Mixture‑of‑Experts) and MLA modules available in the repository.
Training data prepared as token sequences of length T.
Transformer Block
The block follows the standard order:
class TransformerBlock(nn.Module):
def __init__(self, dim, n_head, mlp_ratio, ...):
super().__init__()
self.norm1 = RMSNorm(dim)
self.attn = MultiHeadSelfAttention(dim, n_head)
self.norm2 = RMSNorm(dim)
self.mlp = MLP(dim, int(dim * mlp_ratio))
self.output_norm = RMSNorm(dim) # added to match source code
def forward(self, x, mask=None):
h = self.norm1(x)
h = self.attn(h, mask=mask)
h = self.norm2(h + x)
h = self.mlp(h) + h
return self.output_norm(h)Multi‑Token Prediction (MTP)
MTP predicts tokens k steps ahead by fusing the hidden state from depth k‑1 with the embedding of token i+k. The fusion is a concatenation [h_{i}^{k‑1}; e_{i+k}]. All MTP modules share the main model’s embedding matrix and output head.
For a given prediction depth k:
Input length for MTP is mtp_seq_len = T‑k.
During training a mask hides padded positions.
The loss for depth k is the cross‑entropy between the MTP logits and the ground‑truth token at position i+k.
When the depth is fixed to 1 (the setting used in the repository), the input and target slices are:
# input to MTP (skip first token)
mtp_input_ids = input_ids[:, 1:]
# target for MTP (skip first token)
mtp_targets = targets[:, 1:]Model Assembly
The full model combines the main transformer stack, the MTP modules and the loss computation:
Run each transformer block, collecting sequence‑level auxiliary loss and expert load statistics.
Store the hidden state h_for_mtp for MTP.
Compute the main model loss using RMSNorm and the output head.
For each MTP module (training only) compute logits, apply the mask, and calculate cross‑entropy loss.
Aggregate total loss as loss = main_loss + Σ auxiliary_loss + Σ mtp_loss.
The forward method returns logits, the total loss, and optional monitoring data (expert load, auxiliary losses).
Key Implementation Details
The original paper omitted self.output_norm after each MTP block; it is added to keep the implementation consistent with the source code.
MTP uses a simple MLP for the feed‑forward part; the MoE variant is not required for the fixed depth = 1 setting.
Masking ensures MTP is active only during training.
When seqlen = 1 (inference with KV‑Cache) causal masking is unnecessary; otherwise a causal mask is applied.
Loss Formulation
For each depth k the loss is
loss_k = CrossEntropyLoss(logits_k, targets[:, k+1:])where logits_k are produced by the k -th MTP module. The total loss combines the main next‑token loss ( k=0) with all MTP losses.
Repository
Full training scripts, model checkpoints and the reference implementation are available at:
https://github.com/WKQ9411/Mini-LLM
References
DeepSeek‑V3 technical report: http://arxiv.org/abs/2412.19437
DeepSeek‑V2: http://arxiv.org/abs/2405.04434
DeepSeekMoE: http://arxiv.org/abs/2401.06066
Auxiliary‑loss‑free load balancing: http://arxiv.org/abs/2408.15664
Elegant multi‑head self‑attention with einsum: https://www.cnblogs.com/qftie/p/16245124.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
