Tagged articles

vLLM

148 articles · Page 2 of 2
MaGe Linux Operations
MaGe Linux Operations
Dec 19, 2025 · Artificial Intelligence

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

After discovering that only a few vLLM settings truly impact performance, this guide details how adjusting gpu_memory_utilization, max_num_batched_tokens, and enabling chunked prefill can raise Qwen2.5‑72B‑Instruct throughput from ~1800 to over 2500 tokens/s, improve latency, and provides comprehensive deployment, monitoring, and troubleshooting instructions.

DockerGPUInference Optimization
0 likes · 30 min read
Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks
Baidu Geek Talk
Baidu Geek Talk
Dec 17, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, jointly released by Baidu Baige and Kunlun Chip, provides a high‑performance, zero‑intrusion solution for deploying open‑source large language models on domestic Kunlun XPU hardware, includes fused operators, precision‑validation and profiling tools, and supports over twenty mainstream and multimodal models.

Kunlun XPUModel DeploymentOpen-source
0 likes · 7 min read
Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

KunlunLLMOpen-source
0 likes · 8 min read
Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin
Data Party THU
Data Party THU
Nov 2, 2025 · Operations

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

This guide explains how to unleash vLLM’s full potential by optimizing batch size, leveraging 4‑bit quantization, tuning concurrency parameters, planning capacity with token‑per‑second metrics, and implementing robust monitoring to balance latency, cost, and scalability in production deployments.

BatchingLLM servingMonitoring
0 likes · 10 min read
How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips
Amazon Cloud Developers
Amazon Cloud Developers
Oct 16, 2025 · Artificial Intelligence

Is the Bull Market Still Alive? Stock Analysis with OpenAI and AgentCore

This article walks through deploying OpenAI's open‑source GPT‑OSS models on Amazon SageMaker, building a multi‑agent stock‑analysis workflow with LangGraph, and orchestrating the agents via Amazon Bedrock AgentCore, providing end‑to‑end code, configuration steps, and cleanup procedures.

AgentCoreAmazon SageMakerLLM
0 likes · 17 min read
Is the Bull Market Still Alive? Stock Analysis with OpenAI and AgentCore
Efficient Ops
Efficient Ops
Oct 14, 2025 · Artificial Intelligence

Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

This guide explains what vLLM is, how its PagedAttention architecture boosts LLM throughput, provides step‑by‑step installation commands, showcases core examples for text generation, chat, embedding and classification, and details advanced performance features such as quantization, LoRA support, and distributed parallelism.

GPU AccelerationLLM InferencePython
0 likes · 8 min read
Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize
Eric Tech Circle
Eric Tech Circle
Sep 10, 2025 · Artificial Intelligence

Deploy High‑Performance Local LLMs with vLLM: A Step‑by‑Step Guide

This article walks through installing and configuring vLLM for local large language model inference, compares it with Ollama and LM Studio, details environment setup, model download, testing scripts, and shows how to expose an OpenAI‑compatible API for production use.

Inference OptimizationModelScopeOpenAI API
0 likes · 11 min read
Deploy High‑Performance Local LLMs with vLLM: A Step‑by‑Step Guide
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 17, 2025 · Artificial Intelligence

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

AI InfrastructureKVCacheLLM Inference
0 likes · 30 min read
How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI performanceData ParallelGPU inference
0 likes · 11 min read
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained
Ops Development Stories
Ops Development Stories
Jun 12, 2025 · Cloud Native

One-Click GPU-Enabled Kind Cluster Setup for Running Large AI Models

This tutorial walks you through using a one‑click script to create a GPU‑enabled Kind Kubernetes cluster, evenly distribute GPU resources across nodes with nvkind, install necessary drivers and toolkits, deploy a vLLM‑served large language model, and verify its operation, all on a local or cloud environment.

AI model deploymentDockerGPU
0 likes · 23 min read
One-Click GPU-Enabled Kind Cluster Setup for Running Large AI Models
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementMegatronRL Training
0 likes · 16 min read
How to Train a 671B‑Scale Model with RL: Insights from a verl Internship
Architect's Alchemy Furnace
Architect's Alchemy Furnace
May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang
0 likes · 17 min read
Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama
AIWalker
AIWalker
May 6, 2025 · Artificial Intelligence

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

BenchmarkSupervised Fine‑Tuningautoregressive
0 likes · 14 min read
SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL
Liangxu Linux
Liangxu Linux
Apr 28, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

This guide shows how to use the lightweight OpenStation platform to install, configure, and launch the DeepSeek‑R1 large‑model on a personal server in under 15 minutes, covering zero‑code deployment, resource management, inference engine selection, and integration with CherryStudio.

AI model deploymentCherryStudioDeepSeek-R1
0 likes · 7 min read
Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayDistributed InferenceKubernetes
0 likes · 19 min read
Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 7, 2025 · Artificial Intelligence

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.

DeepSeekGPU memoryMemory Cache
0 likes · 21 min read
Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache
Infra Learning Club
Infra Learning Club
Apr 4, 2025 · Artificial Intelligence

Testing Augment Code: A Powerful New Rival to Cursor

The article evaluates Augment Code, an AI‑powered coding assistant with 200K token context, persistent memory, multimodal input, and top SWE‑bench scores, walks through its installation, explores its use on vllm and PagedAttention, demonstrates adding a new model and auto‑generating a WeChat mini‑program, and compares its capabilities and speed to Cursor.

AI coding assistantAugment CodeCursor
0 likes · 8 min read
Testing Augment Code: A Powerful New Rival to Cursor
Alibaba Cloud Observability
Alibaba Cloud Observability
Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityRay Serve
0 likes · 19 min read
Achieving Full Observability for AI Inference Apps with Prometheus
ByteDance Cloud Native
ByteDance Cloud Native
Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1Distributed Inference
0 likes · 14 min read
How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 18, 2025 · Artificial Intelligence

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

AI inferenceRay Serveprometheus
0 likes · 21 min read
How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 17, 2025 · Cloud Native

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

This guide demonstrates how to deploy the QwQ‑32B large language model on an Alibaba Cloud ACK cluster, configure OSS storage, enable the ACK Gateway with AI Extension, set up InferencePool and InferenceModel resources, and benchmark intelligent routing versus standard gateway routing, revealing latency and throughput improvements.

ACK GatewayAI ExtensionKubernetes
0 likes · 16 min read
Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide
Zhihu Tech Column
Zhihu Tech Column
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismLarge Language ModelsSGLang
0 likes · 11 min read
Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 9, 2025 · Cloud Computing

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

ACSAlibaba CloudBenchmark
0 likes · 17 min read
Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide
AIWalker
AIWalker
Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1GPU OptimizationLLM deployment
0 likes · 39 min read
Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 25, 2025 · Artificial Intelligence

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

This tutorial walks users through installing FlashMLA, integrating it with the vLLM framework, downloading the DeepSeek‑V2‑Lite‑Chat model, benchmarking various MLA implementations, and running a local inference demo that shows FlashMLA’s speed advantage on long‑sequence generation.

DeepSeekFlashMLAInferenceOptimization
0 likes · 16 min read
Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide
Alibaba Cloud Native
Alibaba Cloud Native
Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek
0 likes · 15 min read
Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes
Alibaba Cloud Native
Alibaba Cloud Native
Feb 13, 2025 · Artificial Intelligence

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

This article examines the performance, cost, and stability challenges of large‑scale vLLM deployments, explains the “impossible triangle” dilemma, and provides a detailed, cloud‑native solution using Alibaba Cloud Function Compute GPU reserved instances with step‑by‑step deployment instructions and code examples.

Alibaba CloudGPU Reserved Instancesdeployment guide
0 likes · 14 min read
Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 13, 2025 · Cloud Computing

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

This guide walks you through deploying the DeepSeek‑R1 large‑language‑model inference service on Alibaba Cloud ACK One registered clusters using ACS GPU compute, covering model preparation, OSS storage setup, PersistentVolume configuration, arena‑based service deployment, and verification steps with concrete commands and parameters.

ACK OneACS GPUDeepSeek
0 likes · 14 min read
Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 13, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

This article explains how to quickly deploy the full‑parameter DeepSeek‑R1 671B model in a multi‑node GPU‑enabled Kubernetes cluster on Alibaba Cloud ACK, covering prerequisites, model parallelism, vLLM‑Ray distributed deployment, service verification, and integration with Dify to build a private AI Q&A assistant.

DeepSeekDifyKubernetes
0 likes · 12 min read
Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify
Baidu Geek Talk
Baidu Geek Talk
Feb 12, 2025 · Artificial Intelligence

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

This guide walks you through creating a lightweight compute instance, adding it to Baidu Baige AI heterogeneous computing platform, deploying the vLLM tool, loading and serving small‑scale dense models such as DeepSeek, Llama and Qwen, and provides recommended configuration lists to achieve low‑cost, high‑performance inference.

AI model deploymentBaidu BaigeDeepSeek
0 likes · 3 min read
Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform
Software Engineering 3.0 Era
Software Engineering 3.0 Era
Feb 4, 2025 · Cloud Computing

Comprehensive DeepSeek Deployment: Local, Cloud, Enterprise, Open‑Source Tools & Use Cases

Facing frequent overloads on DeepSeek's official service, this guide details how to run DeepSeek locally with Ollama, deploy it on major cloud platforms such as Huawei, Alibaba, Tencent, Baidu and ZStack, integrate it into enterprise private clusters, leverage open‑source tools like HuggingFace, vLLM and Dify, and showcases real‑world applications in finance, education, and cross‑domain testing.

DeepSeekEnterprise AILLM deployment
0 likes · 10 min read
Comprehensive DeepSeek Deployment: Local, Cloud, Enterprise, Open‑Source Tools & Use Cases
Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU UtilizationTPOT
0 likes · 10 min read
Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU UtilizationLLM Performance
0 likes · 10 min read
How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency
DataFunSummit
DataFunSummit
Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory
0 likes · 25 min read
Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques
Infra Learning Club
Infra Learning Club
Nov 1, 2024 · Artificial Intelligence

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

The article explains vLLM’s GPU compute capability requirement, describes the swap_space and cpu_offload_gb parameters, outlines their ideal usage scenarios, and provides step‑by‑step code examples that demonstrate how adjusting these settings enables loading and running a 7B‑parameter model on a 16 GB T4 GPU.

GPU Memory Managementcpu_offload_gblarge language model inference
0 likes · 9 min read
Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference
21CTO
21CTO
Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentLarge Language ModelsPython
0 likes · 13 min read
Deploy Large Language Models with vLLM and Quantization for Low Latency
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU memoryLLM InferencePagedAttention
0 likes · 25 min read
How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference