DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has quietly open‑sourced TileKernels , a collection of GPU kernels specifically optimized for large‑language‑model (LLM) training and inference. According to DeepSeek, most kernels in the project approach the hardware ceiling for compute intensity and memory bandwidth, and many have already been used in internal training and inference pipelines.

Core Functionality

Gating – Top‑k expert selection and scoring in MoE routing.

MoE Routing – Token‑to‑expert mapping, fusion/expansion, reduction, and weight normalization.

Quantization – Per‑token, per‑block, per‑channel FP8/FP4/E5M6 quantization, with integrated SwiGLU.

Transpose – Batched matrix transpose.

Engram – RMSNorm‑based gated kernel that combines forward/backward passes and weight‑gradient reduction.

Manifold HyperConnection (mHC) – Kernel implementing Sinkhorn normalization and mixed split/apply operations.

Modeling – Wraps low‑level kernels as trainable layers using torch.autograd.Function (e.g., Engram Gate, mHC pipeline).

The listed components correspond closely to the MoE + FP8 training stack introduced in DeepSeek V3/R1, suggesting that TileKernels extracts the core pieces from DeepSeek’s internal pipeline rather than being a research prototype.

Why TileLang Instead of Direct CUDA?

TileKernels is written in TileLang ( tile-ai/tilelang), a domain‑specific language built on TVM for high‑performance GPU kernels. TileLang aims to be more "Pythonic" than Triton while still delivering performance comparable to hand‑written CUDA. DeepSeek’s intent is to provide a concise way to express kernels that approach the performance of manually optimized CUDA code, making the stack accessible to engineers who find CUDA daunting.

Installation Requirements

The only hard requirement is an H100‑class GPU (SM90 or SM100). Additional prerequisites are:

Python 3.10+

PyTorch 2.10+

TileLang 0.1.9+

NVIDIA SM90/SM100 GPU (H100/H200/B200)

CUDA Toolkit 13.1+

Two installation methods are provided:

# Local development (editable)
pip install -e "[dev]"

# Pre‑built release
pip install tile-kernels

Note that the library does not support consumer‑grade GPUs such as the RTX 4090 or even A100.

Testing and Benchmarking

DeepSeek supplies a pytest‑based test suite that can verify correctness or run performance benchmarks. Example commands:

# Verify correctness with 4 parallel workers
pytest tests/transpose/test_transpose.py -n 4

# Run correctness + performance benchmark
pytest tests/transpose/test_transpose.py --run-benchmark

# Full benchmark suite (set FULL_TEST env var, run two rounds)
TK_FULL_TEST=1 pytest -n 4 --count 2

The test framework combines correctness checks, benchmarks, and stress‑test mode, offering a high level of engineering completeness.

Project Layout

tile_kernels/
├── moe/          # MoE routing kernels
├── quant/        # FP8/FP4/E5M6 quantization
├── transpose/    # Batched transpose
├── engram/       # Engram gated kernel
├── mhc/          # Manifold HyperConnection kernel
├── modeling/     # Autograd wrappers (Engram, mHC)
├── torch/        # Pure PyTorch reference implementations
└── testing/      # Tests and benchmark utilities

Each kernel is paired with a pure PyTorch reference in the torch/ directory, allowing developers to compare the high‑performance implementation with an easy‑to‑understand baseline.

Personal Assessment

Pros:

Real‑world code that has been used internally, achieving performance near hardware limits.

Dual‑language (TileLang + PyTorch) design lowers the learning curve for kernel developers.

Comprehensive engineering quality: tests, benchmarks, and MIT licensing.

Focused coverage on the most performance‑critical parts of LLM training: MoE routing and low‑precision (FP8/FP4) quantization.

Cons:

High hardware barrier: requires SM90‑class GPUs; consumer GPUs are unsupported.

Documentation is minimal; users must read the source to understand details.

TileLang is a relatively new DSL with a smaller community than Triton.

Target audience is framework developers or researchers building training/inference engines, not casual API users.

Conclusion

TileKernels continues DeepSeek’s pattern of openly releasing production‑grade, performance‑critical code rather than just model weights. The library provides a valuable learning resource for engineers who want to understand or build the low‑level kernels behind MoE training and FP8 quantization, provided they have access to the required high‑end GPUs.

Suitable readers include developers interested in the internals of MoE kernels, engineers building inference engines or training frameworks, researchers exploring FP8/FP4 low‑precision training, and anyone wanting to master the TileLang DSL.

MoEGPU OptimizationLLM trainingFP8 quantizationTileLangTileKernels
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.