Tagged articles

kernel fusion

6 articles · Page 1 of 1

May 13, 2026 · Artificial Intelligence

Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

vLLM tops the Artificial Analysis ranking by delivering the highest throughput for DeepSeek V3.2, Qwen 3.5 397B, and MiniMax‑M2.5 on identical NVIDIA Blackwell Ultra hardware, thanks to extensive kernel‑fusion optimizations that remain in the main branch.

DeepSeekLLM InferenceQwen

0 likes · 7 min read

Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

Old Zhang's AI Learning

Apr 26, 2026 · Artificial Intelligence

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

The article dissects DeepSeek‑V4’s local deployment using vLLM, explaining the steep hardware requirements, the complex heterogeneous KV‑cache architecture, and the aggressive kernel‑fusion and multi‑stream optimizations that together make high‑context inference both memory‑intensive and engineering‑heavy.

DeepSeek-V4GPU memoryKV cache

0 likes · 15 min read

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

Linux Kernel Journey

Sep 24, 2025 · Fundamentals

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

This article explains why certain GPU performance gains require direct CUDA kernel edits and walks through fine‑grained techniques such as data‑layout restructuring, warp‑level primitives, tiled memory accesses, kernel fusion, and dynamic execution paths, backed by code examples and benchmark insights.

CUDAGPU Optimizationdynamic execution

0 likes · 12 min read

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

Network Intelligence Research Center (NIRC)

Jul 15, 2025 · Fundamentals

How to Write High‑Performance GPU Code with OpenAI Triton

This article introduces OpenAI's Triton language, compares its block‑wise programming model to traditional CUDA, walks through vector‑addition and fused‑softmax kernel implementations, and presents benchmark results that demonstrate significant speedups over native PyTorch operations.

CUDAGPU programmingPyTorch

0 likes · 10 min read

How to Write High‑Performance GPU Code with OpenAI Triton

iQIYI Technical Product Team

Mar 15, 2024 · Artificial Intelligence

Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

By fusing sparse‑feature operators, enabling multi‑stream execution, consolidating data copies, and merging inference batches, iQIYI reduced GPU CTR model latency to CPU‑level, boosted throughput over sixfold, and cut operational costs by more than 40%, overcoming launch‑overhead bottlenecks.

CTRGPUInference Optimization

0 likes · 10 min read

Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

Alibaba Cloud Big Data AI Platform

Dec 11, 2023 · Artificial Intelligence

How PAI‑Blade Supercharges PyTorch Training with Up to 41% Speedup

This article explains how PAI‑Blade uses compiler optimizations, TorchDynamo, MHLO conversion, and aggressive kernel fusion to accelerate PyTorch training, provides simple two‑line integration code, showcases benchmark results on A10 and A100 GPUs, and details deployment steps on PAI‑DSW.

BladeDISCGPU OptimizationPAI-Blade

0 likes · 8 min read

How PAI‑Blade Supercharges PyTorch Training with Up to 41% Speedup