Tagged articles

LLM serving

7 articles · Page 1 of 1

Jul 4, 2026 · Backend Development

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

The SGLang team turned their benchmarking, profiling, CUDA kernel tuning, and production‑issue triage know‑how into reusable agent skills, merging three KDA‑Pilot PRs that delivered up to 2.75× kernel acceleration, a 71.4% throughput boost for Qwen3‑Next and a TTFT reduction from 456 ms to 168 ms, while outlining a repeatable workflow and practical rules for large‑scale performance engineering.

CUDA optimizationLLM servingSGLang

0 likes · 16 min read

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

Raymond Ops

Apr 27, 2026 · Artificial Intelligence

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

This article analyzes why vLLM's PagedAttention can cause GPU memory fragmentation and out‑of‑memory errors in production, presents four typical OOM scenarios, and provides concrete diagnostics, configuration tweaks, code examples, and monitoring strategies to eliminate the problem.

CUDAGPU memoryLLM serving

0 likes · 22 min read

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

Data Party THU

Nov 2, 2025 · Operations

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

This guide explains how to unleash vLLM’s full potential by optimizing batch size, leveraging 4‑bit quantization, tuning concurrency parameters, planning capacity with token‑per‑second metrics, and implementing robust monitoring to balance latency, cost, and scalability in production deployments.

BatchingLLM servingPerformance Tuning

0 likes · 10 min read

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

ByteDance Cloud Native

Feb 21, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1‑Distill on Volcengine CPU Cloud for Low‑Cost AI Inference

This guide walks you through deploying the DeepSeek‑R1‑Distill model on Volcengine CPU ECS instances, covering use‑case scenarios, recommended server types, Docker setup, environment configuration, and verification steps to achieve cost‑effective, high‑compatibility AI inference.

AI model deploymentCPU inferenceDeepSeek

0 likes · 6 min read

Deploy DeepSeek‑R1‑Distill on Volcengine CPU Cloud for Low‑Cost AI Inference

NewBeeNLP

Jan 14, 2025 · R&D Management

How to Kickstart Your CS Research Journey and Find LLM Serving Ideas

The author shares a candid half‑year reflection on entering computer‑science research, outlining practical steps for discovering research ideas, navigating papers, focusing on LLM serving systems, and emphasizing collaboration to help newcomers succeed in academia.

LLM servingSystem Designacademic journey

0 likes · 9 min read

How to Kickstart Your CS Research Journey and Find LLM Serving Ideas

Alibaba Cloud Big Data AI Platform

Sep 17, 2024 · Artificial Intelligence

Boosting LLM Inference: How NanoFlow Doubles Throughput

The article introduces NanoFlow, a novel service framework that leverages intra‑device parallelism, operation‑based pipelining, and async scheduling to significantly improve large language model serving throughput, achieving up to 1.91× higher performance while integrating with Alibaba Cloud PAI.

Alibaba Cloud PAIGPU schedulingLLM serving

0 likes · 7 min read

Boosting LLM Inference: How NanoFlow Doubles Throughput

Alibaba Cloud Big Data AI Platform

Jul 11, 2024 · Artificial Intelligence

How Llumnix Cuts LLM Serving Latency by 10× with Dynamic Scheduling

Alibaba Cloud's PAI team unveiled Llumnix, a dynamic scheduling framework for large language model serving that dramatically reduces tail latency, speeds high‑priority requests, and cuts costs, earning acceptance at OSDI 2024 and now open‑sourced on GitHub.

AI SystemsDynamic SchedulingLLM serving

0 likes · 5 min read

How Llumnix Cuts LLM Serving Latency by 10× with Dynamic Scheduling