Backend Development 16 min read

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

The SGLang team turned their benchmarking, profiling, CUDA kernel tuning, and production‑issue triage know‑how into reusable agent skills, merging three KDA‑Pilot PRs that delivered up to 2.75× kernel acceleration, a 71.4% throughput boost for Qwen3‑Next and a TTFT reduction from 456 ms to 168 ms, while outlining a repeatable workflow and practical rules for large‑scale performance engineering.

AI Engineering

Jul 4, 2026

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

Why SGLang Is Suited for Agent‑Assisted Development

SGLang is a high‑performance LLM and multimodal serving framework. Development repeatedly encounters complex LLM paths, diffusion pipelines, high validation costs across GPUs (H100, H200, B200, RTX 5090), hard‑to‑reuse profiling traces, and performance conclusions that depend heavily on context. These characteristics map naturally to scripted agents with clear inputs and outputs.

From Prompt Engineering to Skills

The team encoded common workflows into .claude/skills files, each describing when to use the skill, how to launch it, verify results, make decisions, and deliver artifacts. Covered layers include: debug-cuda-crash: records custom op/kernel inputs, exceptions, and dumps for offline analysis. llm-serving-auto-benchmark: runs fair, bounded, reproducible serving benchmarks on SGLang and other OpenAI‑compatible frameworks. llm-serving-capacity-planner: parses launch logs to explain weight memory, KV‑cache budget, CUDA‑graph overhead, request capacity, and OOM pressure. llm-torch-profiler-analysis: produces fixed kernel, overlap‑opportunity, and fuse‑pattern tables, mapping kernels back to Python source. llm-pipeline-analysis: slices torch‑profiler traces into forward passes, layers, and kernel streams, generating per‑layer timelines and clustering statistics.

…and many others for diffusion benchmarking, model addition, performance tuning, production incident triage, and PR history humanisation.

Recent Merged Cases

Router long‑context tokenization deduplication (PR #28744) : on DeepSeek‑V4‑Flash, idle TTFT dropped 29%/41% for 60k/125k token prompts, with load TTFT reduced 34%–49%.

Qwen3‑Next FlashInfer all‑reduce fusion (PR #22664) : on H100 TP=4, request throughput rose from 5.49 req/s to 9.41 req/s (+71.4%) and average TTFT fell from 456 ms to 168 ms after profile‑driven collective optimization.

Cohere2Moe NVFP4 fused‑MoE (PR #27401) : on 1× B300, throughput improved 26% (chat) and 21% (summarisation) over the previous default path, surpassing another open‑source framework by 4.1%/6.8%.

Kimi Delta Attention prefill kernel (PR #27488) : on B200, Delta Attention prefill outperformed Triton by 1.08×–1.52× after extensive validation.

Spectral Progressive Diffusion (PR #27524) : denoising speedup of 1.6×–2.32× across several diffusion models by early low‑resolution passes and GPU‑DCT up‑sampling.

LTX‑2 VAE decode channels‑last‑3d (PR #27431) : decode time cut from 5.41 s to 3.84 s (1.41×) and peak memory reduced by ~9.7 GiB.

Two‑Step Profile Analysis

Step 1 uses llm-torch-profiler-analysis to convert a global profile into three fixed tables: a Kernel Table (GPU‑time share, launch count, kernel type, Python mapping), an Overlap Opportunity Table (exclusive/hidden time ratios), and a Fuse Pattern Table (comparisons with SGLang, other frameworks, and kernel libraries).

Step 2 runs llm-pipeline-analysis to map hotspots to specific forward passes, layer types, and kernel streams, reading Chrome trace JSON and model config to produce forward‑pass summaries, per‑layer timelines, layer‑cluster statistics, and compute‑flow tables—especially useful for hybrid‑attention models.

Loop Engineering: Turning “Chasing SOTA” into a Repeatable Process

Single‑round optimisations can be captured by one skill, but after many iterations the state (best candidate, failed directions, benchmark alignment, stop criteria) must be persisted. The SGLang SOTA Performance Loop builds on Humanize/RLCR: a humanize-gen-plan creates a structured plan.md (goals, acceptance criteria, tests, boundaries, milestones); humanize-rlcr drives the loop, with Claude Code executing implementations and Codex Review checking status, evidence, and risk after each round. A lighter Codex Goal alternative can replace the dual‑model setup.

KDA‑Pilot: Industrialising CUDA Kernel Optimisation

Kernel optimisation faces a scaling problem: no single kernel is optimal across hardware, workloads, and models. KDA‑Pilot isolates tasks so agents cannot modify the whole repository. The workflow includes: collecting kernel metadata from 20 diffusion models, cloning the upstream baseline, ensuring identical ABI and build paths, running fixed production workloads with A/B interleaving and CUDA‑event timing, covering correctness checks, refreshing prompts, benchmarks, KernelWiki, and NCU reports each iteration, and recording shape‑specialised dispatch conditions.

Three KDA‑Pilot PRs merged upstream (as of 27 June 2026) demonstrate kernel‑level evidence (e.g., 1.279× profiler‑attributed speedup for Qwen‑Image norm‑scale‑shift) and model‑path evidence (e.g., 1.125× overall request acceleration on B200). A table of ten tracked diffusion kernel tasks shows B200 speedups ranging from 1.13× to 2.75×, with primary optimisation directions such as shared RoPE staging, warp‑row RMS, vectorised I/O, and fused reductions.

Practical Rules

Define task boundaries before launching an agent (e.g., “match another framework on 2× B200 under fixed workload”).

Fix the benchmark before reading profiles; otherwise the agent may optimise a simpler, unintended problem.

Interpret NCU results based on kernel characteristics: memory‑bound kernels focus on DRAM/L2 throughput, compute‑bound on Tensor‑Core utilisation, latency‑bound on launch counts and synchronisation.

Verify backend and fallback thresholds before trusting a profile; a trace that switched to a different attention or diffusion backend is not evidence for the native path.

Kernel optimisation must use identical ABI, wrapper, and compile flags; a candidate cannot silently enable --use_fast_math only on one side.

Review becomes more critical: agents can generate many PRs and plausible errors, so reviews must check shape, dtype, distributed execution, CUDA‑graph behaviour, fallbacks, precision, serving API, metrics, and benchmark configuration.

Agent‑augmented SGLang development does not replace developers but offloads repeatable workflow steps to agents while leaving judgement, design, and review to humans, freeing time for deeper performance challenges and further agent‑workflow improvements—an investment worth sustaining for an open‑source serving framework.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance profiling kernel tuning SGLang CUDA optimization LLM serving agent automation

Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.