Building a Mini‑DeepSeek‑V3: Transformer Block and MTP Implementation on Limited Compute

This article walks through the design and implementation of a Mini‑DeepSeek‑V3 language model, detailing how to assemble the core Transformer block, integrate Multi‑Token Prediction (MTP) modules, construct the overall architecture, and compute the combined loss—all using modest GPU resources and a single‑card or DDP training setup.

Data Party THU
Data Party THU
Data Party THU
Building a Mini‑DeepSeek‑V3: Transformer Block and MTP Implementation on Limited Compute

Overview

Mini‑DeepSeek‑V3 reproduces the core architecture of DeepSeek‑V3 (MLA, DeepSeekMoE, MTP, auxiliary‑loss‑free load‑balancing and sequence‑level auxiliary loss) while fitting on a single GPU or a small DDP cluster. It uses the original RoPE positional encoding and does not employ YaRN for context‑length extension.

Prerequisites

Python ≥ 3.8, PyTorch ≥ 2.0, CUDA‑compatible GPU.

DeepSeek‑MoE implementation (Mixture‑of‑Experts) and MLA modules available in the repository.

Training data prepared as token sequences of length T.

Transformer Block

The block follows the standard order:

class TransformerBlock(nn.Module):
    def __init__(self, dim, n_head, mlp_ratio, ...):
        super().__init__()
        self.norm1 = RMSNorm(dim)
        self.attn = MultiHeadSelfAttention(dim, n_head)
        self.norm2 = RMSNorm(dim)
        self.mlp = MLP(dim, int(dim * mlp_ratio))
        self.output_norm = RMSNorm(dim)   # added to match source code
    def forward(self, x, mask=None):
        h = self.norm1(x)
        h = self.attn(h, mask=mask)
        h = self.norm2(h + x)
        h = self.mlp(h) + h
        return self.output_norm(h)

Multi‑Token Prediction (MTP)

MTP predicts tokens k steps ahead by fusing the hidden state from depth k‑1 with the embedding of token i+k. The fusion is a concatenation [h_{i}^{k‑1}; e_{i+k}]. All MTP modules share the main model’s embedding matrix and output head.

For a given prediction depth k:

Input length for MTP is mtp_seq_len = T‑k.

During training a mask hides padded positions.

The loss for depth k is the cross‑entropy between the MTP logits and the ground‑truth token at position i+k.

When the depth is fixed to 1 (the setting used in the repository), the input and target slices are:

# input to MTP (skip first token)
mtp_input_ids = input_ids[:, 1:]
# target for MTP (skip first token)
mtp_targets = targets[:, 1:]

Model Assembly

The full model combines the main transformer stack, the MTP modules and the loss computation:

Run each transformer block, collecting sequence‑level auxiliary loss and expert load statistics.

Store the hidden state h_for_mtp for MTP.

Compute the main model loss using RMSNorm and the output head.

For each MTP module (training only) compute logits, apply the mask, and calculate cross‑entropy loss.

Aggregate total loss as loss = main_loss + Σ auxiliary_loss + Σ mtp_loss.

The forward method returns logits, the total loss, and optional monitoring data (expert load, auxiliary losses).

Key Implementation Details

The original paper omitted self.output_norm after each MTP block; it is added to keep the implementation consistent with the source code.

MTP uses a simple MLP for the feed‑forward part; the MoE variant is not required for the fixed depth = 1 setting.

Masking ensures MTP is active only during training.

When seqlen = 1 (inference with KV‑Cache) causal masking is unnecessary; otherwise a causal mask is applied.

Loss Formulation

For each depth k the loss is

loss_k = CrossEntropyLoss(logits_k, targets[:, k+1:])

where logits_k are produced by the k -th MTP module. The total loss combines the main next‑token loss ( k=0) with all MTP losses.

Repository

Full training scripts, model checkpoints and the reference implementation are available at:

https://github.com/WKQ9411/Mini-LLM

References

DeepSeek‑V3 technical report: http://arxiv.org/abs/2412.19437

DeepSeek‑V2: http://arxiv.org/abs/2405.04434

DeepSeekMoE: http://arxiv.org/abs/2401.06066

Auxiliary‑loss‑free load balancing: http://arxiv.org/abs/2408.15664

Elegant multi‑head self‑attention with einsum: https://www.cnblogs.com/qftie/p/16245124.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AITransformerDeepSeekMTPMini-LLMModel Implementation
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.