DeepSeek Unveils Tile Kernels and DeepEP V2 – Is V4 on the Horizon?

DeepSeek recently opened the Tile Kernels repository and released DeepEP V2, detailing new GPU kernel features, a fully JIT-enabled expert parallelism redesign that boosts peak performance by up to 1.3× while cutting SM usage fourfold, and hinting at an upcoming V4 release.

Machine Heart
Machine Heart
Machine Heart
DeepSeek Unveils Tile Kernels and DeepEP V2 – Is V4 on the Horizon?

Tile Kernels Release

DeepSeek’s GitHub now hosts the open‑source Tile Kernels library, built with the domain‑specific language TileLang for expressing high‑performance GPU kernels in Python. TileLang aims for easy migration, agile development, and automatic optimization.

The project’s README states that most kernels approach hardware limits in compute intensity and memory bandwidth, and some are already used in internal training and inference workloads, though they are not yet considered best practice.

Key Features of Tile Kernels

Gate mechanism for Top‑k expert selection and scoring in MoE routing.

MoE routing with token‑to‑expert mapping, fused expansion/reduction, and weight normalization.

Quantization supporting per‑token, per‑block, per‑channel FP8/FP4/E5M6 conversions, combined with SwiGLU + quantization.

Batch transpose operations.

Engram gating kernels that fuse RMSNorm, forward/backward passes, and weight‑gradient reduction.

Manifold HyperConnection kernels with Sinkhorn normalization and mix splitting.

Modeling layer that wraps low‑level kernels into trainable torch.autograd.Function components (engram gate, mHC pipeline).

DeepEP V2 Enhancements

Earlier on the same day DeepSeek announced DeepEP V2 , a redesign of expert parallelism that delivers faster expert parallel (EP) execution and adds support for Engram, pipeline parallel (PP), and context parallel (CP).

Compared with V1, V2 achieves up to 1.3× peak performance while using only a fraction of the SM resources (up to four‑fold reduction). The update also introduces experimental 0‑SM solutions for Engram, PP, and CP, leveraging RDMA and a lightweight NCCL‑Gin backend that is header‑only and can reuse existing NCCL communicators.

Additional DeepEP V2 features include:

Fully JIT compilation.

NCCL‑Gin backend – header‑only, extremely lightweight, and compatible with existing NCCL.

Unified high‑throughput, low‑latency EP API with a new GEMM layout.

Support for scaling up to EP2048.

Analytical SM and QP count calculations that eliminate the need for auto‑tuning.

Continued support for hybrid and direct execution modes.

For legacy V3‑style training tasks, SM usage drops from 24 to 4‑6 while maintaining or improving performance.

0‑SM Engram (with RDMA), 0‑SM PP (with RDMA), and 0‑SM CP (with Copy Engine).

Performance Evaluation

Using the DeepSeek‑V3 configuration (8K tokens per batch, hidden size 7168, Top‑8 experts, FP8 distribution, and BF16), DeepEP V2 was benchmarked. The results show logical bandwidth improvements, with the note that in an EP 8 × 2 setup, a reported 90 GB/s includes intra‑GPU traffic.

Compared to V1, V2 delivers a 1.3× increase in peak performance and reduces SM resource consumption by up to four times.

Outlook

The article ends with a light‑hearted comment urging DeepSeek to release V4 soon, indicating community anticipation for the next generation.

PerformanceDeepSeekLLM optimizationExpert ParallelismGPU kernelsDeepEP V2Tile Kernels
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.