Artificial Intelligence 21 min read

Evolution of DeepSeek Mixture‑of‑Experts (MoE) Architecture from V1 to V3

This article reviews the development of DeepSeek's Mixture-of-Experts (MoE) models, tracing their evolution from the original DeepSeekMoE V1 through V2 to V3, detailing architectural innovations such as fine‑grained expert segmentation, shared‑expert isolation, load‑balancing losses, device‑limited routing, and the shift from softmax to sigmoid gating.

Architect

Feb 10, 2025

This article examines DeepSeek's work on Mixture‑of‑Experts (MoE) sparse models, describing the progression from the original DeepSeekMoE V1 to V3 and referencing the underlying papers and source code.

MoE originated with the 1991 paper "Adaptive Mixtures of Local Experts" and was later scaled by Google in the 2020 GShard work, which introduced a Transformer‑based MoE layer that replaces the FFN. An MoE layer consists of three parts: an expert network (a feed‑forward sub‑network), a gating network (producing expert weights), and a selector that chooses top‑k experts per token.

Because naive token‑to‑expert routing can lead to highly imbalanced workloads, auxiliary load‑balancing losses are added. These losses encourage a uniform distribution of tokens across experts by penalising deviations from the ideal token‑per‑expert ratio.

DeepSeek V1 identified two problems in existing MoE models: knowledge mixing (limited experts handling diverse knowledge) and knowledge redundancy (different experts learning overlapping information). To address these, DeepSeek introduced fine‑grained expert segmentation (splitting the hidden dimension to create more, smaller experts) and shared‑expert isolation (designating a set of always‑active shared experts for common knowledge). V1 also added expert‑level and device‑level auxiliary losses for load balancing.

DeepSeek V2 focused on communication efficiency. It added a device‑limited routing mechanism that restricts the number of devices a token's activated experts can span, reducing inter‑device traffic. A communication load‑balancing loss further equalises the amount of data each device receives. Additionally, a token‑dropping strategy discards excess tokens on overloaded devices while keeping the residual connection, improving compute balance without harming inference.

DeepSeek V3 retained the fine‑grained and shared‑expert designs but replaced the softmax gating with a sigmoid function to better handle the larger routing‑expert count (256 vs. 160 in V2). It removed most auxiliary losses, instead introducing a learnable bias b_i for each expert; the bias is decreased for overloaded experts and increased for under‑utilised ones, achieving dynamic load balancing. V3 also adds a sequence‑wise auxiliary loss that balances token distribution at the per‑sequence level, and eliminates token dropping.

The following PyTorch snippet shows the MoEGate implementation used in DeepSeek V1, illustrating batch‑wise token handling and the computation of the auxiliary loss:

class MoEGate(nn.Module):
    def forward(self, hidden_states):
        bsz, seq_len, h = hidden_states.shape
        # hidden_states is the total token count T for the current batch
        hidden_states = hidden_states.view(-1, h)
        logits = F.linear(hidden_states, self.weight, None)
        scores_for_aux = logits.softmax(dim=-1)
        topk_weight, topk_idx = torch.topk(scores_for_aux, k=self.top_k, dim=-1, sorted=False)
        topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
        mask_ce = F.one_hot(topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts)
        ce = mask_ce.float().mean(0)
        Pi = scores_for_aux.mean(0)
        fi = ce * self.n_routed_experts
        aux_loss = (Pi * fi).sum() * self.alpha
        return aux_loss

In summary, DeepSeek's MoE evolution introduced (1) fine‑grained and shared experts for better specialization, (2) multiple load‑balancing mechanisms (expert‑level, device‑level, communication, and sequence‑wise losses) and token‑dropping to keep compute and communication balanced, and (3) a gating redesign from softmax to sigmoid with dynamic bias‑based load balancing in V3.

References: [1] Adaptive Mixtures of Local Experts (1991). [2] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020). [5] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture‑of‑Experts Language Models (2024). [6] DeepSeek‑V2: A Strong, Economical, and Efficient Mixture‑of‑Experts Language Model (2024). [7] DeepSeek‑V3 Technical Report (2024).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM load balancing Mixture of Experts Neural Networks DeepSeek Sparse Models

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.