Tagged articles

SGLang

30 articles · Page 1 of 1
AI Engineering
AI Engineering
Jul 4, 2026 · Backend Development

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

The SGLang team turned their benchmarking, profiling, CUDA kernel tuning, and production‑issue triage know‑how into reusable agent skills, merging three KDA‑Pilot PRs that delivered up to 2.75× kernel acceleration, a 71.4% throughput boost for Qwen3‑Next and a TTFT reduction from 456 ms to 168 ms, while outlining a repeatable workflow and practical rules for large‑scale performance engineering.

CUDA optimizationLLM servingSGLang
0 likes · 16 min read
How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups
Old Zhang's AI Learning
Old Zhang's AI Learning
Jun 29, 2026 · Artificial Intelligence

How Nvidia’s NVFP4 Cuts GLM‑5.2 Deployment Cost by Half

Semgrep’s benchmark shows open‑source GLM‑5.2 matching Claude’s performance while costing only $0.17 per vulnerability, and Nvidia’s NVFP4 quantization halves the model’s memory footprint with virtually unchanged accuracy, making local deployment on 8‑GPU systems far more affordable.

AI DeploymentGLM-5.2Model Quantization
0 likes · 11 min read
How Nvidia’s NVFP4 Cuts GLM‑5.2 Deployment Cost by Half
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 2, 2026 · Artificial Intelligence

OSCAR Beats TurboQuant: 2‑Bit KV‑Cache for Fast, Stable Long‑Context Inference

OSCAR presents an attention‑aware rotation scheme that compresses KV caches to true 2‑bit, cutting memory usage by up to 8× and boosting decode throughput by up to 7×, while preserving inference quality within a few points of BF16 across multiple models and long‑context benchmarks, outperforming TurboQuant.

2-bit quantizationKV cacheOSCAR
0 likes · 13 min read
OSCAR Beats TurboQuant: 2‑Bit KV‑Cache for Fast, Stable Long‑Context Inference
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

AI inferenceDeepSeekGPU
0 likes · 12 min read
How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline
Old Zhang's AI Learning
Old Zhang's AI Learning
May 11, 2026 · Artificial Intelligence

Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4

Ant Group's Ling‑2.6‑1T, a 1‑trillion‑parameter LLM built for token efficiency and fast‑thinking, outperforms on elite reasoning and agentic benchmarks, offers easy local deployment via vLLM or SGLang, provides a quantized 3.6‑bit version, and includes practical usage tips for developers and knowledge workers.

Agentic ModelClaude Code IntegrationLing-2.6-1T
0 likes · 12 min read
Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4
Machine Heart
Machine Heart
May 8, 2026 · Industry Insights

How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure

RadixArk raised a $100 million seed round backed by top hardware and AI investors to turn the open‑source SGLang inference engine and the Miles RL framework into day‑0 standards, aiming to democratize AI infrastructure and eliminate bottlenecks from training to inference.

AI InfrastructureDeepSeek-V4Hardware‑agnostic AI
0 likes · 10 min read
How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure
Old Zhang's AI Learning
Old Zhang's AI Learning
May 1, 2026 · Artificial Intelligence

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

The article analyzes DeepSeek‑V4's architectural innovations—including mixed sparse attention, mHC, and native FP4 weights—explains SGLang's ShadowRadix, HiSparse, and in‑graph speculative decoding solutions, presents benchmark gains, provides Docker deployment steps, and warns of key pitfalls for long‑context inference.

DeepSeek-V4HiSparseSGLang
0 likes · 15 min read
DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 20, 2026 · Artificial Intelligence

Kimi K2.6: The Most Powerful Open-Source Agent Model – Architecture, Benchmarks, and Deployment Guide

Kimi K2.6, an open-source 1-trillion-parameter MoE model, expands Agent capabilities with 256K context, multimodal inputs, and the ability to coordinate 300 sub-Agents over 4,000 steps, achieving top scores on benchmarks like Terminal-Bench 2.0, SWE-Bench Pro, and BrowseComp, while offering flexible deployment via vLLM, SGLang, and KTransformers.

Agent ModelBenchmarkDeployment
0 likes · 11 min read
Kimi K2.6: The Most Powerful Open-Source Agent Model – Architecture, Benchmarks, and Deployment Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashQwen3.5SGLang
0 likes · 12 min read
Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 12, 2026 · Artificial Intelligence

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

MiniMax‑M2.7, the newly open‑sourced 230‑billion‑parameter MoE model, offers self‑evolution, professional software engineering and agent capabilities, and can be deployed locally using Ollama, vLLM, SGLang or Docker with 4‑8 H200 GPUs, while the article details hardware needs, performance gains and tool‑calling/Thinking features.

DeploymentGPULLM
0 likes · 11 min read
Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV cacheLLM InferencePerformance Optimization
0 likes · 13 min read
Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference
AI Engineering
AI Engineering
Mar 25, 2026 · Artificial Intelligence

Is “Harness Engineering” Just Rebranded Engineering Common Sense?

The article examines the hype around “harness engineering” in LLM workflows, showing through SGLang’s multi‑agent experience that the approach merely repackages established software‑engineering principles such as separation of concerns, docs‑as‑code, and structured routing, and discusses its limits and future implications.

Harness EngineeringLLMSGLang
0 likes · 8 min read
Is “Harness Engineering” Just Rebranded Engineering Common Sense?
DeepHub IMBA
DeepHub IMBA
Mar 3, 2026 · Artificial Intelligence

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

The article traces five eras of KV cache management for LLM inference—from its absence before Transformers to the emerging unified hybrid memory architecture—comparing vLLM, SGLang, and TensorRT‑LLM and offering a decision framework for selecting the right solution in various deployment scenarios.

KV cacheLLM InferenceMemory Management
0 likes · 16 min read
The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 23, 2026 · Artificial Intelligence

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

GLM‑ASR‑Nano‑2512, a 1.5 B‑parameter open‑source speech‑recognition model released in December 2025, delivers state‑of‑the‑art accuracy on Chinese dialects and low‑volume audio, outperforms Whisper V3 on benchmark tests, runs on consumer GPUs, and provides detailed installation and deployment guides for transformers, vLLM and SGLang.

Chinese dialectsGLM-ASR-Nano-2512Open-source
0 likes · 11 min read
Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs
MaGe Linux Operations
MaGe Linux Operations
Jan 6, 2026 · Artificial Intelligence

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

FlashInferGPU OptimizationH800
0 likes · 19 min read
How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 23, 2025 · Artificial Intelligence

How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference

This article explains how SGLang’s hybrid model design combines Transformer attention with Mamba state‑space layers, introduces a dual‑pool memory architecture and elastic allocation, and presents specialized prefix‑cache and speculative‑decoding techniques that together enable efficient, scalable inference for long‑context large language models.

Inference OptimizationKVCacheLarge Language Models
0 likes · 22 min read
How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 17, 2025 · Artificial Intelligence

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

The article details the Attention‑FFN Disaggregation (AFD) technique used by Baidu Baige to separate self‑attention and feed‑forward network stages in DeepSeek‑V3 models, describing multi‑stage scheduling, three‑batch overlap, communication optimizations, and performance results that achieve up to 19% throughput improvement under a 100 ms SLO.

3BOAFDAttention-FFN Disaggregation
0 likes · 17 min read
How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU UtilizationLLM Inference
0 likes · 9 min read
Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap
Meituan Technology Team
Meituan Technology Team
Sep 11, 2025 · Artificial Intelligence

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.

LongCat-FlashMixture of ExpertsSGLang
0 likes · 15 min read
How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 17, 2025 · Artificial Intelligence

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

AI InfrastructureKVCacheLLM Inference
0 likes · 30 min read
How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementMegatronRL Training
0 likes · 16 min read
How to Train a 671B‑Scale Model with RL: Insights from a verl Internship
Architect's Alchemy Furnace
Architect's Alchemy Furnace
May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang
0 likes · 17 min read
Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama
Liangxu Linux
Liangxu Linux
Apr 28, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

This guide shows how to use the lightweight OpenStation platform to install, configure, and launch the DeepSeek‑R1 large‑model on a personal server in under 15 minutes, covering zero‑code deployment, resource management, inference engine selection, and integration with CherryStudio.

AI model deploymentCherryStudioDeepSeek-R1
0 likes · 7 min read
Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code
Zhihu Tech Column
Zhihu Tech Column
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismLarge Language ModelsSGLang
0 likes · 11 min read
Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Meituan Technology Team
Meituan Technology Team
Mar 6, 2025 · Artificial Intelligence

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Meituan’s search and recommendation team converted the FP8‑only DeepSeek‑R1 model to INT8 by first casting weights to BF16 and then applying block‑wise or channel‑wise quantization, which preserves GSM8K and MMLU accuracy while delivering 33% to 50% higher throughput on A100‑80G GPUs, and they released the SGLang‑based inference scripts and quantized weights publicly, enabling deployment on older NVIDIA hardware without accuracy loss.

DeepSeek-R1GPU deploymentINT8 Quantization
0 likes · 11 min read
INT8 Quantization and Inference Optimization of DeepSeek R1 Model
Architects' Tech Alliance
Architects' Tech Alliance
Feb 27, 2025 · Artificial Intelligence

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

The Inspur Metabrain R1 inference server, equipped with FP8 acceleration and a 1128 GB HBM3e memory pool, has been tightly integrated with SGLang 0.4.3 to run the 671‑billion‑parameter DeepSeek R1 model, delivering over 1,000 concurrent user sessions and up to 3,976 tokens/s throughput.

AI serverDeepSeekInference Optimization
0 likes · 5 min read
How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization