Tagged articles

vLLM

148 articles · Page 2 of 2

Dec 19, 2025 · Artificial Intelligence

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

After discovering that only a few vLLM settings truly impact performance, this guide details how adjusting gpu_memory_utilization, max_num_batched_tokens, and enabling chunked prefill can raise Qwen2.5‑72B‑Instruct throughput from ~1800 to over 2500 tokens/s, improve latency, and provides comprehensive deployment, monitoring, and troubleshooting instructions.

DockerGPUInference Optimization

0 likes · 30 min read

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

Baidu Geek Talk

Dec 17, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, jointly released by Baidu Baige and Kunlun Chip, provides a high‑performance, zero‑intrusion solution for deploying open‑source large language models on domestic Kunlun XPU hardware, includes fused operators, precision‑validation and profiling tools, and supports over twenty mainstream and multimodal models.

Kunlun XPUModel DeploymentOpen-source

0 likes · 7 min read

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

Baidu Intelligent Cloud Tech Hub

Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

KunlunLLMOpen-source

0 likes · 8 min read

Data Party THU

Nov 2, 2025 · Operations

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

This guide explains how to unleash vLLM’s full potential by optimizing batch size, leveraging 4‑bit quantization, tuning concurrency parameters, planning capacity with token‑per‑second metrics, and implementing robust monitoring to balance latency, cost, and scalability in production deployments.

BatchingLLM servingMonitoring

0 likes · 10 min read

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

Tencent Technical Engineering

Oct 31, 2025 · Artificial Intelligence

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

SpecExit combines speculative sampling with a lightweight draft model to predict early‑exit signals, shortening large‑reasoning model chains by up to two‑thirds and achieving up to 2.5× end‑to‑end inference acceleration on vLLM without sacrificing accuracy.

AI efficiencyEarly StoppingInference Optimization

0 likes · 12 min read

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

Alibaba Cloud Native

Oct 17, 2025 · Artificial Intelligence

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

This article details the cost and speed challenges of embedding vectors in large‑scale log scenarios, analyzes inference framework choices, describes GPU utilization, priority queuing, and pipeline redesigns, and reports a 16‑fold throughput increase and dramatically lower per‑request costs.

EmbeddingGPU OptimizationThroughput

0 likes · 8 min read

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

Amazon Cloud Developers

Oct 16, 2025 · Artificial Intelligence

Is the Bull Market Still Alive? Stock Analysis with OpenAI and AgentCore

This article walks through deploying OpenAI's open‑source GPT‑OSS models on Amazon SageMaker, building a multi‑agent stock‑analysis workflow with LangGraph, and orchestrating the agents via Amazon Bedrock AgentCore, providing end‑to‑end code, configuration steps, and cleanup procedures.

AgentCoreAmazon SageMakerLLM

0 likes · 17 min read

Is the Bull Market Still Alive? Stock Analysis with OpenAI and AgentCore

Efficient Ops

Oct 14, 2025 · Artificial Intelligence

Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

This guide explains what vLLM is, how its PagedAttention architecture boosts LLM throughput, provides step‑by‑step installation commands, showcases core examples for text generation, chat, embedding and classification, and details advanced performance features such as quantization, LoRA support, and distributed parallelism.

GPU AccelerationLLM InferencePython

0 likes · 8 min read

Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

Eric Tech Circle

Sep 10, 2025 · Artificial Intelligence

Deploy High‑Performance Local LLMs with vLLM: A Step‑by‑Step Guide

This article walks through installing and configuring vLLM for local large language model inference, compares it with Ollama and LM Studio, details environment setup, model download, testing scripts, and shows how to expose an OpenAI‑compatible API for production use.

Inference OptimizationModelScopeOpenAI API

0 likes · 11 min read

Deploy High‑Performance Local LLMs with vLLM: A Step‑by‑Step Guide

Volcano Engine Developer Services

Jul 17, 2025 · Artificial Intelligence

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

This article examines how Volcano Engine's Elastic Instant Cache (EIC) tackles the memory bottleneck, high‑concurrency latency, and cross‑node coordination challenges of large language model inference by decoupling storage and computation, pooling resources, and applying layered optimizations, ultimately boosting AI inference efficiency, scalability, and cost‑effectiveness across various deployment scenarios.

AI InfrastructureKVCacheLLM Inference

0 likes · 30 min read

How Distributed KVCache (EIC) Revolutionizes Large‑Model Inference Performance

Instant Consumer Technology Team

Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI performanceData ParallelGPU inference

0 likes · 11 min read

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

Alibaba Cloud Native

Jun 28, 2025 · Cloud Native

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks through deploying vLLM inference services on a GPU‑enabled Kubernetes cluster using llmaz, configuring Higress as an AI gateway for traffic control, observability, and fallback model switching, and demonstrates end‑to‑end request testing.

FallbackHigressObservability

0 likes · 15 min read

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

Ops Development Stories

Jun 15, 2025 · Artificial Intelligence

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

This article walks through deploying the high‑performance vLLM LLM inference framework, covering GPU and CPU backend installation, environment setup, offline and online serving, API usage, and a performance comparison that highlights the ten‑fold speed advantage of GPU over CPU.

CPU deploymentGPU deploymentLLM Inference

0 likes · 38 min read

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

Ops Development Stories

Jun 12, 2025 · Cloud Native

One-Click GPU-Enabled Kind Cluster Setup for Running Large AI Models

This tutorial walks you through using a one‑click script to create a GPU‑enabled Kind Kubernetes cluster, evenly distribute GPU resources across nodes with nvkind, install necessary drivers and toolkits, deploy a vLLM‑served large language model, and verify its operation, all on a local or cloud environment.

AI model deploymentDockerGPU

0 likes · 23 min read

One-Click GPU-Enabled Kind Cluster Setup for Running Large AI Models

Baobao Algorithm Notes

Jun 3, 2025 · Artificial Intelligence

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

This article shares a detailed, first‑hand analysis of the technical challenges, framework choices, memory management, weight conversion, precision alignment, and efficiency optimizations encountered while building reinforcement‑learning pipelines for a 671‑billion‑parameter model using the verl ecosystem.

GPU Memory ManagementMegatronRL Training

0 likes · 16 min read

How to Train a 671B‑Scale Model with RL: Insights from a verl Internship

Baobao Algorithm Notes

May 20, 2025 · Artificial Intelligence

Boosting RLHF Training Efficiency with Asynchronous vLLM and Ray Integration

This article explains how an asynchronous RLHF pipeline built on vLLM, Ray, and OpenRLHF dramatically reduces training bottlenecks by decoupling inference, environment interaction, and model updates, and provides detailed implementation code and design choices for scalable reinforcement learning.

OpenRLHFRLHFRay

0 likes · 11 min read

Boosting RLHF Training Efficiency with Asynchronous vLLM and Ray Integration

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

AIWalker

May 6, 2025 · Artificial Intelligence

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

BenchmarkSupervised Fine‑Tuningautoregressive

0 likes · 14 min read

SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

Liangxu Linux

Apr 28, 2025 · Artificial Intelligence

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

This guide shows how to use the lightweight OpenStation platform to install, configure, and launch the DeepSeek‑R1 large‑model on a personal server in under 15 minutes, covering zero‑code deployment, resource management, inference engine selection, and integration with CherryStudio.

AI model deploymentCherryStudioDeepSeek-R1

0 likes · 7 min read

Deploy DeepSeek‑R1 on Your Server in 15 Minutes with Zero Code

Alibaba Cloud Infrastructure

Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayDistributed InferenceKubernetes

0 likes · 19 min read

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

Alibaba Cloud Developer

Apr 7, 2025 · Artificial Intelligence

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.

DeepSeekGPU memoryMemory Cache

0 likes · 21 min read

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

Infra Learning Club

Apr 4, 2025 · Artificial Intelligence

Testing Augment Code: A Powerful New Rival to Cursor

The article evaluates Augment Code, an AI‑powered coding assistant with 200K token context, persistent memory, multimodal input, and top SWE‑bench scores, walks through its installation, explores its use on vllm and PagedAttention, demonstrates adding a new model and auto‑generating a WeChat mini‑program, and compares its capabilities and speed to Cursor.

AI coding assistantAugment CodeCursor

0 likes · 8 min read

Testing Augment Code: A Powerful New Rival to Cursor

Alibaba Cloud Observability

Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityRay Serve

0 likes · 19 min read

Achieving Full Observability for AI Inference Apps with Prometheus

ByteDance Cloud Native

Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1Distributed Inference

0 likes · 14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

Alibaba Cloud Developer

Mar 18, 2025 · Artificial Intelligence

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

AI inferenceRay Serveprometheus

0 likes · 21 min read

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

Alibaba Cloud Infrastructure

Mar 17, 2025 · Cloud Native

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

This guide demonstrates how to deploy the QwQ‑32B large language model on an Alibaba Cloud ACK cluster, configure OSS storage, enable the ACK Gateway with AI Extension, set up InferencePool and InferenceModel resources, and benchmark intelligent routing versus standard gateway routing, revealing latency and throughput improvements.

ACK GatewayAI ExtensionKubernetes

0 likes · 16 min read

Boost LLM Inference with ACK Gateway AI Extension: A Step‑by‑Step Guide

Zhihu Tech Column

Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismLarge Language ModelsSGLang

0 likes · 11 min read

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

Alibaba Cloud Infrastructure

Mar 9, 2025 · Cloud Computing

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

ACSAlibaba CloudBenchmark

0 likes · 17 min read

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

Alibaba Cloud Infrastructure

Mar 8, 2025 · Artificial Intelligence

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.

ACKBenchmarkKubernetes

0 likes · 17 min read

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

ByteDance Cloud Native

Mar 7, 2025 · Artificial Intelligence

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

This guide walks you through the end‑to‑end process of deploying the open‑source QwQ‑32B inference model on Volcengine's cloud platform, covering GPU ECS selection, VKE cluster creation, continuous delivery CP setup, vLLM service launch, and API gateway exposure.

GPU ECSQwQ-32BVKE

0 likes · 8 min read

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

AIWalker

Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1GPU OptimizationLLM deployment

0 likes · 39 min read

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

Alibaba Cloud Big Data AI Platform

Feb 25, 2025 · Artificial Intelligence

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

This tutorial walks users through installing FlashMLA, integrating it with the vLLM framework, downloading the DeepSeek‑V2‑Lite‑Chat model, benchmarking various MLA implementations, and running a local inference demo that shows FlashMLA’s speed advantage on long‑sequence generation.

DeepSeekFlashMLAInferenceOptimization

0 likes · 16 min read

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

Alibaba Cloud Native

Feb 18, 2025 · Cloud Native

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

This guide shows how to overcome on‑premise compute limits by registering a local Kubernetes cluster to Alibaba Cloud ACK One, provisioning ACS GPU resources, and deploying the DeepSeek‑R1 inference model with the vLLM framework through a series of concrete commands and YAML configurations.

ACK OneACS GPUDeepSeek

0 likes · 15 min read

Deploy DeepSeek‑R1 on Alibaba Cloud ACK One Using ACS GPU in Minutes

Alibaba Cloud Native

Feb 13, 2025 · Artificial Intelligence

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

This article examines the performance, cost, and stability challenges of large‑scale vLLM deployments, explains the “impossible triangle” dilemma, and provides a detailed, cloud‑native solution using Alibaba Cloud Function Compute GPU reserved instances with step‑by‑step deployment instructions and code examples.

Alibaba CloudGPU Reserved Instancesdeployment guide

0 likes · 14 min read

Tackling the ‘Impossible Triangle’: Scaling vLLM on Alibaba Cloud GPU Reservations

Alibaba Cloud Infrastructure

Feb 13, 2025 · Cloud Computing

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

This guide walks you through deploying the DeepSeek‑R1 large‑language‑model inference service on Alibaba Cloud ACK One registered clusters using ACS GPU compute, covering model preparation, OSS storage setup, PersistentVolume configuration, arena‑based service deployment, and verification steps with concrete commands and parameters.

ACK OneACS GPUDeepSeek

0 likes · 14 min read

Deploy DeepSeek‑R1 LLM on Alibaba Cloud ACK One with ACS GPU in Minutes

Alibaba Cloud Infrastructure

Feb 13, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

This article explains how to quickly deploy the full‑parameter DeepSeek‑R1 671B model in a multi‑node GPU‑enabled Kubernetes cluster on Alibaba Cloud ACK, covering prerequisites, model parallelism, vLLM‑Ray distributed deployment, service verification, and integration with Dify to build a private AI Q&A assistant.

DeepSeekDifyKubernetes

0 likes · 12 min read

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

Alibaba Cloud Infrastructure

Feb 12, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI

This guide explains how to prepare an Alibaba Cloud GPU instance, install Docker and NVIDIA tools, pull or build a container image, and run the FP8‑quantized DeepSeek‑R1‑Distill‑Qwen‑32B model using vLLM and OpenWebUI for both offline and online inference.

DeepSeekFP8 quantizationGPU

0 likes · 18 min read

Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI

Baidu Geek Talk

Feb 12, 2025 · Artificial Intelligence

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

This guide walks you through creating a lightweight compute instance, adding it to Baidu Baige AI heterogeneous computing platform, deploying the vLLM tool, loading and serving small‑scale dense models such as DeepSeek, Llama and Qwen, and provides recommended configuration lists to achieve low‑cost, high‑performance inference.

AI model deploymentBaidu BaigeDeepSeek

0 likes · 3 min read

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

Alibaba Cloud Developer

Feb 5, 2025 · Artificial Intelligence

Deploy DeepSeek Models on Alibaba Cloud PAI with One-Click: A Step-by-Step Guide

This tutorial shows how to log into Alibaba Cloud PAI, navigate to the Model Gallery, select a DeepSeek model such as the distilled DeepSeek‑R1‑Distill‑Qwen‑7B, and deploy it with a single click using vLLM or BladeLLM, providing endpoint and token details for immediate use.

AIAlibaba CloudBladeLLM

0 likes · 3 min read

Deploy DeepSeek Models on Alibaba Cloud PAI with One-Click: A Step-by-Step Guide

Software Engineering 3.0 Era

Feb 4, 2025 · Cloud Computing

Comprehensive DeepSeek Deployment: Local, Cloud, Enterprise, Open‑Source Tools & Use Cases

Facing frequent overloads on DeepSeek's official service, this guide details how to run DeepSeek locally with Ollama, deploy it on major cloud platforms such as Huawei, Alibaba, Tencent, Baidu and ZStack, integrate it into enterprise private clusters, leverage open‑source tools like HuggingFace, vLLM and Dify, and showcases real‑world applications in finance, education, and cross‑domain testing.

DeepSeekEnterprise AILLM deployment

0 likes · 10 min read

Comprehensive DeepSeek Deployment: Local, Cloud, Enterprise, Open‑Source Tools & Use Cases

Baidu Geek Talk

Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU UtilizationTPOT

0 likes · 10 min read

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Baobao Algorithm Notes

Jan 9, 2025 · Artificial Intelligence

How to Efficiently Deploy and Manage 100 LoRA‑Enhanced LLMs with vLLM

A technical walkthrough shows how to use vLLM to load multiple LoRA adapters for role‑playing LLMs, analyzes the massive GPU and labor costs of naïve deployment, and presents a hosted multi‑LoRA platform as a cost‑effective solution.

AI inferenceLLMLoRA

0 likes · 11 min read

How to Efficiently Deploy and Manage 100 LoRA‑Enhanced LLMs with vLLM

Baidu Intelligent Cloud Tech Hub

Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU UtilizationLLM Performance

0 likes · 10 min read

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

DataFunSummit

Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory

0 likes · 25 min read

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

Infra Learning Club

Nov 1, 2024 · Artificial Intelligence

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

The article explains vLLM’s GPU compute capability requirement, describes the swap_space and cpu_offload_gb parameters, outlines their ideal usage scenarios, and provides step‑by‑step code examples that demonstrate how adjusting these settings enables loading and running a 7B‑parameter model on a 16 GB T4 GPU.

GPU Memory Managementcpu_offload_gblarge language model inference

0 likes · 9 min read

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

DeWu Technology

Aug 19, 2024 · Artificial Intelligence

Multi‑LoRA Deployment for Large Language Models: Concepts, Fine‑tuning, and Cost‑Effective Strategies

The article introduces a multi‑LoRA strategy that lets many scenario‑specific adapters share a single base LLM, dramatically cutting GPU usage and cost while preserving performance, and explains how to fine‑tune with LoRA, merge adapters, and serve them efficiently using VLLM.

LoRAModel Deploymentfine-tuning

0 likes · 10 min read

Multi‑LoRA Deployment for Large Language Models: Concepts, Fine‑tuning, and Cost‑Effective Strategies

21CTO

Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentLarge Language ModelsPython

0 likes · 13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

Baobao Algorithm Notes

Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU memoryLLM InferencePagedAttention

0 likes · 25 min read

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference