Tagged articles
8 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 7, 2026 · Artificial Intelligence

How TileLang Enables Efficient Small Operators in Large LLMs (DeepSeek V4 Report)

The article analyzes TileLang, the DSL behind DeepSeek V4, showing how its Fragment and Parallel abstractions, host‑side codegen via TVM‑FFI, and Z3 prover integration let developers implement fused small operators with hand‑written performance, faster development, and easier maintenance.

DSLDeepSeekGPU compiler
0 likes · 11 min read
How TileLang Enables Efficient Small Operators in Large LLMs (DeepSeek V4 Report)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Nov 24, 2025 · Artificial Intelligence

Simplifying AI Operator Development with TileLang DSL

TileLang is a Python‑style DSL built on TVM that separates algorithm logic from hardware scheduling, offers beginner to expert interfaces, supports multiple GPU and CPU backends, and delivers performance on par with or better than existing AI kernels, as demonstrated with GEMM, FlashAttention and other benchmarks.

AI operatorsDSLGEMM
0 likes · 10 min read
Simplifying AI Operator Development with TileLang DSL
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Sep 3, 2025 · Artificial Intelligence

Understanding AI Compilers: A TVM Example

The article explains how AI compilers transform high‑level models into efficient hardware code, using TVM to illustrate operator optimization, automated scheduling, and end‑to‑end compilation workflow with concrete code examples and performance considerations.

AI compilerDeep LearningTVM
0 likes · 8 min read
Understanding AI Compilers: A TVM Example
DaTaobao Tech
DaTaobao Tech
Jul 15, 2022 · Artificial Intelligence

Edge AI Model Evaluation and Optimization with TensorFlow, JAX, and TVM

The article demonstrates how to evaluate, compress, and convert deep‑learning models for edge devices using TensorFlow, JAX, and TVM—showing a faster iPhone‑based MNIST training benchmark, FLOPs measurement scripts, TFLite/ONNX/CoreML conversion, TVM compilation with auto‑tuning, and up to 50 % speed improvements on mobile NPU hardware.

JAXTVMTensorFlow
0 likes · 29 min read
Edge AI Model Evaluation and Optimization with TensorFlow, JAX, and TVM
Alibaba Terminal Technology
Alibaba Terminal Technology
Jun 22, 2022 · Artificial Intelligence

How Fast Can Your Smartphone Run ML Models? Exploring Edge AI Optimization

This article examines the computational capabilities of modern mobile devices for machine learning, compares training times on a MacBook and iPhone, explains model evaluation metrics like FLOPs, and provides step‑by‑step guides for converting and optimizing models using TensorFlow, PyTorch, ONNX, JAX, and TVM for edge deployment.

JAXModel OptimizationTVM
0 likes · 29 min read
How Fast Can Your Smartphone Run ML Models? Exploring Edge AI Optimization
Meituan Technology Team
Meituan Technology Team
Mar 3, 2022 · Artificial Intelligence

GPU Optimization Practices for Meituan Delivery Search and Recommendation Model Inference

Meituan’s delivery search and recommendation service migrated from separate CPU‑only models to a unified multi‑task model running on a heterogeneous CPU‑GPU architecture, applying system‑level placement, All‑On‑GPU lookup, FP16 mixed precision, operator fusion, TensorRT and TVM compilation, which together delivered roughly a four‑fold increase in inference throughput while maintaining cost.

GPUTVMTensorFlow
0 likes · 24 min read
GPU Optimization Practices for Meituan Delivery Search and Recommendation Model Inference
Meituan Technology Team
Meituan Technology Team
Sep 9, 2021 · Artificial Intelligence

GPU Optimization Practices for CTR Models at Meituan

Meituan accelerates CTR model inference by fusing operators with TVM, optimizing CPU‑GPU data transfers, manually tuning high‑frequency subgraphs, and dynamically offloading workloads, achieving up to ten‑fold throughput gains on Tesla T4 GPUs while keeping latency stable and only modestly increasing beyond 128 QPS, though compilation remains slow and large‑model support needs improvement.

CTRDeep LearningGPU
0 likes · 16 min read
GPU Optimization Practices for CTR Models at Meituan
Ctrip Technology
Ctrip Technology
Jul 23, 2020 · Artificial Intelligence

Inference Performance Optimization for AI Applications: Methods, Case Studies, and Future Directions

This article examines the challenges of deep learning inference, outlines general optimization methodologies—including system-level and model-level techniques—presents practical case studies such as Transformer translation model improvements, and discusses future trends in automated compilation and performance tuning for AI services.

AI inferenceDeep LearningTVM
0 likes · 15 min read
Inference Performance Optimization for AI Applications: Methods, Case Studies, and Future Directions