Artificial Intelligence 10 min read

Simplifying AI Operator Development with TileLang DSL

TileLang is a Python‑style DSL built on TVM that separates algorithm logic from hardware scheduling, offers beginner to expert interfaces, supports multiple GPU and CPU backends, and delivers performance on par with or better than existing AI kernels, as demonstrated with GEMM, FlashAttention and other benchmarks.

Network Intelligence Research Center (NIRC)

Nov 24, 2025

Simplifying AI Operator Development with TileLang DSL

In the era of large AI models, optimizing low‑level operators such as GEMM, de‑quantized GEMM, FlashAttention and LinearAttention is crucial for inference speed and training efficiency, yet traditional development is time‑consuming and requires deep hardware expertise.

TileLang introduces a new programming paradigm: a concise domain‑specific language that lets developers describe what to compute while the compiler automatically handles how to map it to hardware . This separation of compute flow and scheduling lets users focus on algorithmic logic, reducing the entry barrier.

The language provides three hierarchical interfaces:

Beginner Interface : a fully hardware‑agnostic environment where only the computational graph is expressed.

Developer Interface : high‑level operators such as T.gemm and T.copy with annotations for memory layout or pipeline depth, similar to Triton but with stronger automation.

Expert Interface : direct access to thread primitives for fine‑grained control, achieving performance comparable to hand‑written CUDA/HIP kernels.

Built on TVM’s compiler stack, TileLang inherits cross‑platform capabilities. It runs on NVIDIA GPUs (H100 with Auto TMA/WGMMA, A100, V100, RTX 4090, RTX 3090, RTX A6000), AMD GPUs (MI250 with Auto MatrixCore, MI300X with Async Copy), and x86_64 CPUs supporting AVX2/AVX‑512.

Benchmark results show that TileLang‑generated kernels match or exceed the performance of other mainstream implementations, with examples such as MLA on H100 and FlashAttention on H100 (see figures).

A concrete GEMM example illustrates TileLang’s workflow: developers declare shared memory ( T.alloc_shared) and register fragments ( T.alloc_fragment), define a kernel grid with T.Kernel, and write a three‑stage software pipeline using T.Pipelined. The compiler inserts necessary synchronisation and scheduling, eliminating manual __syncthreads() and complex index calculations.

import tilelang
from tilelang import Profiler
import tilelang.language as T

def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float"):
    """Create a TileLang GEMM operator.
    C = A @ B
    """
    @T.prim_func
    def main(
        A: T.Tensor((M, K), dtype),
        B: T.Tensor((K, N), dtype),
        C: T.Tensor((M, N), dtype)
    ):
        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
            A_shared = T.alloc_shared((block_M, block_K), dtype)
            B_shared = T.alloc_shared((block_K, block_N), dtype)
            C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
            T.clear(C_local)
            for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
                T.copy(A[by * block_M, k * block_K], A_shared)
                T.copy(B[k * block_K, bx * block_N], B_shared)
                T.gemm(A_shared, B_shared, C_local)
            T.copy(C_local, C[by * block_M, bx * block_N])
    return main

This example demonstrates TileLang’s ability to express complex parallel computation succinctly while the underlying compiler handles memory management, thread mapping, and pipeline generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance DSL GPU TVM GEMM TileLang AI operators

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.