Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture

This article analyzes the polarization issues in industrial Mixture‑of‑Experts (MoE) frameworks, explains expert collapse, degradation, and under‑fitting, and details the HOME model’s input types, architectural innovations, normalization, gating mechanisms, and related DICE‑BN insights.

NewBeeNLP
NewBeeNLP
NewBeeNLP
Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture

Polarization Phenomena in MoE

In industry, the widely used multi‑task framework is the expert‑mix (MoE) paradigm, which introduces shared and task‑specific experts and uses a gating network to weigh their contributions. Three main polarization problems are observed:

Expert collapse (all experts): Output distributions differ significantly, and many experts have ReLU zero‑activation rates exceeding 90%, making fair weight allocation difficult.

Expert degradation (shared experts): Some shared experts are dominated by a single task, losing their ability to serve multiple tasks.

Expert under‑fitting (task‑specific experts): Sparse tasks often ignore their dedicated experts, assigning larger weights to shared experts, because shared experts receive richer gradient signals from dense tasks.

Model Input

The HOME model’s features are roughly divided into four categories:

ID and categorical features obtained via simple lookups (e.g., user ID, item ID, tag ID, activity flags, author follow status, scene ID).

Statistical features that are bucketed and assigned IDs (e.g., short‑video view counts and durations over the past month).

Sequential features reflecting short‑ and long‑term user interests, modeled with one‑ or two‑stage attention mechanisms such as DIN, DIEN, SIM, Twin.

Pre‑trained multimodal embeddings (text, ASR, video, etc.).

Model Architecture

MOE Series Review

The evolution of non‑dependent multi‑task modeling based on experts includes:

Shared‑bottom architecture where multiple tasks share the same MLP.

Mixture‑of‑Experts (MoE) where different tasks are distinguished by separate gates.

ML‑MMOE that stacks MOE layers to boost expert learning capacity.

CGC that designs task‑specific experts.

PLE that employs both specific and shared experts in the embedding layer and MLP.

HOME Network Structure

Expert Normalization & Swish Mechanism Experts (MLP_E) exhibit significant mean and variance differences. Applying Batch Normalization aligns their distributions toward a standard normal shape, reducing the proportion of values that become zero after ReLU. To avoid activation saturation, the Swish function (x·sigmoid(x)) replaces ReLU, yielding more balanced expert outputs.

Hierarchy Mask Mechanism Tasks are grouped into two categories: active interaction tasks (e.g., likes, comments) and passive watch‑time tasks (e.g., effective duration). Grouping amplifies inter‑group expert value differences.

Feature‑gate & Self‑gate Mechanisms Sparse tasks receive low gate weights, causing their specific experts to be ignored. Two gate mechanisms are introduced to ensure adequate gradient flow:

Feature‑gate: Computes importance for each input feature element, generating task‑specific feature representations. Inspired by LoRA in LLMs, two small matrices approximate a large matrix, scaling the output mean to 1.

Self‑gate: Differentiates tasks based on task count K, providing task‑specific gating.

Summary

The HOME design improves upon PLE in two key ways:

It normalizes expert network values, allowing shared and task‑specific experts to operate independently.

It adds self‑gate and global‑gate components at the feature level for each task, ensuring balanced expert contributions.

Appendix

DICE and Batch Normalization Relationship

DICE, used in DIN, adapts its step point based on data distribution, preventing activation values from clustering at 0 or 1. Its formula resembles a combination of BN and sigmoid.

class Dice(tf.keras.layers.Layer):
    def __init__(self):
        super(Dice, self).__init__()
        self.bn = tf.keras.layers.BatchNormalization(
            center=False,
            scale=False,
            epsilon=1e-9,
            axis=-1)
        self.alpha = self.add_weight(name='dice_alpha', initializer='zeros', trainable=True)
    def call(self, x, **kwargs):
        bnx = self.bn(x)
        px = tf.sigmoid(bnx)
        return px * x + (1 - px) * self.alpha * x

HOME and BN Connection

HOME likely borrows the DICE idea by applying BN to each expert before the gating network, preventing expert outputs from collapsing to 0 or 1 and enabling the gate to allocate reasonable weights. Directly replacing expert activations with DICE would only balance intra‑expert distributions, not inter‑expert ones; thus, placing BN before the gate remains the most elegant solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mixture of ExpertsMulti-Task Learningmodel architectureExpert NormalizationGating Mechanisms
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.