Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal

Extensive experiments on DeepSeek's 1.7B and 8B models reveal that replacing the manifold hyper‑connection (mHC) constraint with a simple identity matrix consistently outperforms the original mHC, improves signal flow stability, and avoids the collapse caused by repeated Sinkhorn‑Knopp projections.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal

Recent weeks of studying DeepSeek's mHC (manifold hyper‑connection) led to an unexpected experimental conclusion: the core algorithmic improvement of the paper—applying a manifold constraint that forces the matrix to be doubly‑stochastic via Sinkhorn‑Knopp—may not be necessary at all.

Key observation : directly using an identity matrix (i.e., no constraint) yields noticeably better results than the original mHC, mHC‑lite, or orthogonal variants such as Cayley.

Experiments were conducted on Qwen‑3 1.7B and 8B dense models trained for 150 B tokens, aiming to eliminate the influence of a known Megatron coding bug. The results rank the variants as follows:

Identity HC > mHC > mHC lite > mHC orthogonal (e.g., Cayley)

The identity matrix is a unit matrix whose diagonal entries are 1, rows and columns sum to 1, and its spectral norm is 1, making it the simplest “manifold constraint”. Intuitively, this means each residual stream preserves its own information without mixing with others.

In practice, the original mHC learns patterns such as:

Single‑layer behavior: matrices close to the identity (diagonal ≈ 0.96, off‑diagonal ≈ 0.01).

Accumulated product: collapses to a uniform 0.25 matrix after multiple layers.

Mathematically, when a doubly‑stochastic matrix satisfies a uniform positivity condition, its Dobrushin contraction coefficient decays geometrically, forcing all rows to converge to the same distribution and ultimately yielding a uniform matrix. Strict proofs rely on the multiplicativity of the Dobrushin coefficient; pure permutation or reducible matrices break this condition, but Sinkhorn outputs are strictly positive and thus satisfy it.

By fixing the learned single‑layer behavior to the identity, we retain the desirable per‑stream residual while eliminating the rank‑1 uniform collapse caused by repeated doubly‑stochastic multiplication.

A remaining open question is whether degenerating to a permutation matrix brings benefits or harms. Empirically, different layers of mHC learn distinct approximate permutation matrices, causing streams to be reordered across layers (e.g., stream 1 becomes stream 3 after layer 1, then stream 2 after layer 5). This reordering forces the model to track the position of each stream, increasing learning difficulty.

Identity offers three clear advantages:

Stream 0 always stays at position 0, stream 1 at position 1—semantic consistency across depth.

No need to adapt to stream reordering; the model simply learns “which stream to read from and write to”.

The accumulated product does not collapse nor become chaotic.

Standard residual connections in Transformers add an identity mapping to a transformed output (thanks to ResNet). Hyper‑Connections (HC) expand the residual flow from one to multiple parallel streams (default = 2), with three learnable mappings handling different aspects of the flow.

DeepSeek’s core innovation constrains the norm to the doubly‑stochastic manifold, ensuring norm preservation. Multi‑layer H multiplication does not cause signal explosion, though it does not guarantee signal preservation; in our experiments, mHC’s spectral norm stays near 1.

The original mHC paper excels at clearly presenting the HC formula and improving the activation function from tanh to sigmoid, which restricts values to be non‑negative.

输入: x_l = [s, b, n*C]  (4 streams flattened)
Step 1: φ projection (cross‑stream fusion occurs here)
   x̂ = φ → [s, b, 2n]  (identity mode projects only 2n=8 dimensions)
   φ receives information from all 4 streams
Step 2: Activation
   h_pre = sigmoid(α_pre · proj[:n] + b[:n])   ← dynamic aggregation weights
   h_post = 2·sigmoid(α_post · proj[n:2n] + b[n:2n])   ← dynamic expansion weights
Step 3: Aggregation (explicit cross‑stream fusion)
   aggregated = Σ h_pre_i · x_stream_i   ← [s, b, C]
Step 4: Transformation
   output = f(aggregated)   ← Attention or MLP
Step 5: Identity residual + dynamic write‑back
   x_{l+1} = I · x_l + diag(h_post) · f(...)
            = x_l + diag(h_post) · f(...)   ← each stream keeps its residual independently

Empirically, the doubly‑stochastic constraint provides little benefit and can be harmful: the product of doubly‑stochastic matrices inevitably collapses. Perron‑Frobenius theory shows the largest eigenvalue is 1, while all other eigenvalues decay, causing the accumulated product’s smallest singular value to approach zero after many layers.

Measurements on Qwen‑3‑1.7B (28 layers, 56 HC modules) show the average minimal singular value of the Sinkhorn‑projected matrices is about 0.49, matching theoretical estimates. After 56 HC modules, shallow signals are likely to vanish except for the mean direction, which can be advantageous.

Twenty Sinkhorn‑Knopp iterations approximate projection onto the doubly‑stochastic manifold but do not guarantee convergence; we observed a row‑sum standard deviation of 0.12, which can accumulate across layers. In mHC‑lite, about 27.9 % of inputs have a relative range that leads to column‑sum errors up to 100 % after 20 iterations.

Various failed attempts were explored:

Convex combinations of mHC‑lite with exact doubly‑stochastic constraints using softmax weighting performed worse than the original mHC.

Increasing the initialization scale caused the softmax temperature to drop, making outputs near one‑hot and reducing mixing.

Orthogonalization (Cayley transform, Givens rotations) kept the spectral norm at 1 but introduced negative values, causing some streams to flip sign and collapse model capacity.

Additional analysis visualizations (produced with Opus) illustrate channel strength between layers, showing a “mid‑layer radiation, deep‑layer aggregation” pattern: layers 20‑40 act as information hubs, while the final layers receive heavily but emit little.

Overall, discarding the “m” manifold constraint in DeepSeek’s mHC and using a simple identity matrix yields better performance, simplifies implementation (no need for custom Sinkhorn‑Knopp kernels), and maintains stable signal flow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerDeepSeekIdentitySinkhornmHCHyper-Connection
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.