Tagged articles
25 articles
Page 1 of 1
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

AI inferenceDeepSeekGPU
0 likes · 12 min read
How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline
Old Zhang's AI Learning
Old Zhang's AI Learning
May 11, 2026 · Artificial Intelligence

Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4

Ant Group's Ling‑2.6‑1T, a 1‑trillion‑parameter LLM built for token efficiency and fast‑thinking, outperforms on elite reasoning and agentic benchmarks, offers easy local deployment via vLLM or SGLang, provides a quantized 3.6‑bit version, and includes practical usage tips for developers and knowledge workers.

Agentic ModelClaude Code IntegrationLing-2.6-1T
0 likes · 12 min read
Ling-2.6-1T: 1T‑Parameter, Fast‑Thinking, Agent‑Ready Model After DeepSeek‑V4
Machine Heart
Machine Heart
May 8, 2026 · Industry Insights

How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure

RadixArk raised a $100 million seed round backed by top hardware and AI investors to turn the open‑source SGLang inference engine and the Miles RL framework into day‑0 standards, aiming to democratize AI infrastructure and eliminate bottlenecks from training to inference.

AI InfrastructureDeepSeek-V4Hardware‑agnostic AI
0 likes · 10 min read
How SGLang’s $100M Seed Funding Powers the Next‑Gen Open AI Infrastructure
Old Zhang's AI Learning
Old Zhang's AI Learning
May 1, 2026 · Artificial Intelligence

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

The article analyzes DeepSeek‑V4's architectural innovations—including mixed sparse attention, mHC, and native FP4 weights—explains SGLang's ShadowRadix, HiSparse, and in‑graph speculative decoding solutions, presents benchmark gains, provides Docker deployment steps, and warns of key pitfalls for long‑context inference.

DeepSeek-V4HiSparseSGLang
0 likes · 15 min read
DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 20, 2026 · Artificial Intelligence

Kimi K2.6: The Most Powerful Open-Source Agent Model – Architecture, Benchmarks, and Deployment Guide

Kimi K2.6, an open-source 1-trillion-parameter MoE model, expands Agent capabilities with 256K context, multimodal inputs, and the ability to coordinate 300 sub-Agents over 4,000 steps, achieving top scores on benchmarks like Terminal-Bench 2.0, SWE-Bench Pro, and BrowseComp, while offering flexible deployment via vLLM, SGLang, and KTransformers.

Agent ModelBenchmarkDeployment
0 likes · 11 min read
Kimi K2.6: The Most Powerful Open-Source Agent Model – Architecture, Benchmarks, and Deployment Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationSGLang
0 likes · 12 min read
Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 12, 2026 · Artificial Intelligence

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

MiniMax‑M2.7, the newly open‑sourced 230‑billion‑parameter MoE model, offers self‑evolution, professional software engineering and agent capabilities, and can be deployed locally using Ollama, vLLM, SGLang or Docker with 4‑8 H200 GPUs, while the article details hardware needs, performance gains and tool‑calling/Thinking features.

DeploymentGPULLM
0 likes · 11 min read
Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV cacheLLM inferencePerformance Optimization
0 likes · 13 min read
Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference
AI Engineering
AI Engineering
Mar 25, 2026 · Artificial Intelligence

Is “Harness Engineering” Just Rebranded Engineering Common Sense?

The article examines the hype around “harness engineering” in LLM workflows, showing through SGLang’s multi‑agent experience that the approach merely repackages established software‑engineering principles such as separation of concerns, docs‑as‑code, and structured routing, and discusses its limits and future implications.

Harness EngineeringLLMMulti-Agent
0 likes · 8 min read
Is “Harness Engineering” Just Rebranded Engineering Common Sense?
DeepHub IMBA
DeepHub IMBA
Mar 3, 2026 · Artificial Intelligence

The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture

The article traces five eras of KV cache management for LLM inference—from its absence before Transformers to the emerging unified hybrid memory architecture—comparing vLLM, SGLang, and TensorRT‑LLM and offering a decision framework for selecting the right solution in various deployment scenarios.

KV cacheLLM inferenceMemory Management
0 likes · 16 min read
The Evolution of KV Cache Management: From Continuous Allocation to Unified Hybrid Memory Architecture
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 23, 2026 · Artificial Intelligence

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

GLM‑ASR‑Nano‑2512, a 1.5 B‑parameter open‑source speech‑recognition model released in December 2025, delivers state‑of‑the‑art accuracy on Chinese dialects and low‑volume audio, outperforms Whisper V3 on benchmark tests, runs on consumer GPUs, and provides detailed installation and deployment guides for transformers, vLLM and SGLang.

Chinese dialectsGLM-ASR-Nano-2512SGLang
0 likes · 11 min read
Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs
MaGe Linux Operations
MaGe Linux Operations
Jan 6, 2026 · Artificial Intelligence

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

FlashInferGPU OptimizationH800
0 likes · 19 min read
How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 23, 2025 · Artificial Intelligence

How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference

This article explains how SGLang’s hybrid model design combines Transformer attention with Mamba state‑space layers, introduces a dual‑pool memory architecture and elastic allocation, and presents specialized prefix‑cache and speculative‑decoding techniques that together enable efficient, scalable inference for long‑context large language models.

Inference OptimizationKVCacheSGLang
0 likes · 22 min read
How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 17, 2025 · Artificial Intelligence

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

The article details the Attention‑FFN Disaggregation (AFD) technique used by Baidu Baige to separate self‑attention and feed‑forward network stages in DeepSeek‑V3 models, describing multi‑stage scheduling, three‑batch overlap, communication optimizations, and performance results that achieve up to 19% throughput improvement under a 100 ms SLO.

3BOAFDAttention-FFN Disaggregation
0 likes · 17 min read
How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference
0 likes · 9 min read
Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap
Meituan Technology Team
Meituan Technology Team
Sep 11, 2025 · Artificial Intelligence

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.

LongCat-FlashLow latencyMixture of Experts
0 likes · 15 min read
How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 17, 2025 · Artificial Intelligence

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

AI InfrastructureKVCacheLLM inference
0 likes · 30 min read
How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementMegatronModel Parallelism
0 likes · 16 min read
How to Train a 671B‑Scale Model with RL: Insights from a verl Internship
Architect's Alchemy Furnace
Architect's Alchemy Furnace
May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

InferenceLLMMLX
0 likes · 17 min read
Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama
Liangxu Linux
Liangxu Linux
Apr 28, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

This guide shows how to use the lightweight OpenStation platform to install, configure, and launch the DeepSeek‑R1 large‑model on a personal server in under 15 minutes, covering zero‑code deployment, resource management, inference engine selection, and integration with CherryStudio.

AI Model DeploymentCherryStudioDeepSeek-R1
0 likes · 7 min read
Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code
Zhihu Tech Column
Zhihu Tech Column
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismSGLangTensor Parallelism
0 likes · 11 min read
Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Meituan Technology Team
Meituan Technology Team
Mar 6, 2025 · Artificial Intelligence

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Meituan’s search and recommendation team converted the FP8‑only DeepSeek‑R1 model to INT8 by first casting weights to BF16 and then applying block‑wise or channel‑wise quantization, which preserves GSM8K and MMLU accuracy while delivering 33% to 50% higher throughput on A100‑80G GPUs, and they released the SGLang‑based inference scripts and quantized weights publicly, enabling deployment on older NVIDIA hardware without accuracy loss.

DeepSeek-R1GPU deploymentINT8 Quantization
0 likes · 11 min read
INT8 Quantization and Inference Optimization of DeepSeek R1 Model
Architects' Tech Alliance
Architects' Tech Alliance
Feb 27, 2025 · Artificial Intelligence

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

The Inspur Metabrain R1 inference server, equipped with FP8 acceleration and a 1128 GB HBM3e memory pool, has been tightly integrated with SGLang 0.4.3 to run the 671‑billion‑parameter DeepSeek R1 model, delivering over 1,000 concurrent user sessions and up to 3,976 tokens/s throughput.

AI serverDeepSeekInference Optimization
0 likes · 5 min read
How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization