Collection size
99 articles
Page 2 of 5
MaGe Linux Operations
MaGe Linux Operations
Dec 27, 2025 · Artificial Intelligence

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

GPULLMQuantization
0 likes · 48 min read
How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Jun 7, 2026 · Artificial Intelligence

Hands‑On LLM Local Deployment: vLLM Inference Optimizations Explained

The article explains why LLM inference is memory‑bound, introduces vLLM’s three core optimizations—Continuous Batching, PagedAttention, and Prefix Caching—shows how to launch a vLLM server, run Python code to benchmark performance, and examines KV‑Cache memory usage with concrete numbers.

Continuous BatchingKV cacheLLM inference
0 likes · 11 min read
Hands‑On LLM Local Deployment: vLLM Inference Optimizations Explained
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 12, 2026 · Artificial Intelligence

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

CUDA Graphcold-start optimizationlarge-model inference
0 likes · 16 min read
How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 18, 2026 · Artificial Intelligence

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.

AIKunlunhardware
0 likes · 12 min read
How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins
Geek Labs
Geek Labs
May 7, 2026 · Artificial Intelligence

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

This article introduces two recent GitHub projects—club‑3090, which enables single‑ or dual‑RTX 3090 inference of 27‑billion‑parameter models with detailed performance benchmarks, and library‑skills, a tool that keeps AI agents synchronized with the latest official library APIs—explaining their configurations, usage steps, hardware requirements, and target audiences.

AI AgentsDockerRTX 3090
0 likes · 7 min read
Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions
58 Tech
58 Tech
Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersMulti-LoRA
0 likes · 35 min read
How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference
Ops Development Stories
Ops Development Stories
Sep 19, 2024 · Artificial Intelligence

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

This tutorial walks through setting up a local k3d cluster, installing Higress, and using its AI plugins—including AI Proxy, AI JSON formatter, AI Agent, and AI Statistics—to integrate and observe Alibaba Cloud's Qwen large language models across various use cases such as weather and flight queries.

AI gatewayAI pluginsHigress
0 likes · 30 min read
How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
May 31, 2026 · Artificial Intelligence

vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression

The vLLM 0.22 stable release introduces production‑grade DeepSeek V4 support, massive kernel fusions, up to 10‑20× speedups, Batch Invariance with 28.9% latency gain, a Rust front‑end, multi‑level KV cache offload that can double context length, and broad hardware coverage across NVIDIA, AMD, CPU and RISC‑V, making it a pivotal upgrade for inference infrastructure teams.

Batch InvarianceDeepSeek V4KV cache
0 likes · 13 min read
vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression
Old Meng AI Explorer
Old Meng AI Explorer
Apr 20, 2026 · Artificial Intelligence

Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide

This article explains what NVIDIA NIM is, compares its generous free quota to other LLM providers, lists the supported free models, walks through a five‑minute sign‑up, shows three code examples for calling the API, offers model‑selection advice, and provides a hands‑on case for building a free AI chat interface.

AI modelsAPI integrationFree LLM API
0 likes · 16 min read
Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide
Alibaba Cloud Native
Alibaba Cloud Native
Aug 21, 2025 · Cloud Native

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

This article explains why traditional load‑balancing methods fall short for large language model services and introduces Higress AI Gateway's three specialized algorithms—global minimum‑request, prefix‑matching, and GPU‑aware load balancing—detailing their design, Redis‑based implementation, deployment steps, and performance gains.

GPULLMRedis
0 likes · 11 min read
How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 4, 2025 · Artificial Intelligence

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Cache offloadGPU‑CPU optimizationLLM inference
0 likes · 21 min read
How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Apr 29, 2026 · Artificial Intelligence

Deploy DeepSeek‑V4 on Ascend NPU with Kthena in 3 Minutes (Prefill‑Decode Separation)

This guide walks through deploying the DeepSeek‑V4‑Flash model on Ascend NPU using Kthena’s ModelRoute, detailing the Prefill‑Decode (P/D) separation architecture, KV cache transfer via Mooncake, configuration of ModelServing and ModelRoute resources, and flexible scaling of Prefill and Decode replicas for optimal performance.

Ascend NPUDeepSeek V4KV cache
0 likes · 22 min read
Deploy DeepSeek‑V4 on Ascend NPU with Kthena in 3 Minutes (Prefill‑Decode Separation)
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 19, 2026 · Artificial Intelligence

Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)

The article provides a 2026 deep comparative analysis of three major large‑model inference frameworks—vLLM, llama.cpp, and MLX—detailing their core designs, recent updates, benchmark results on various hardware, deployment complexity, and recommended use cases to help developers choose the right tool.

MLXbenchmarkframework comparison
0 likes · 15 min read
Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)
21CTO
21CTO
Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentPythonQuantization
0 likes · 13 min read
Deploy Large Language Models with vLLM and Quantization for Low Latency
Alibaba Cloud Native
Alibaba Cloud Native
Jan 26, 2024 · Artificial Intelligence

Deploy a Serverless Stable Diffusion API for Scalable AI Image Generation

This guide explains how to overcome GPU cost, high‑concurrency, and model‑switching challenges by using Alibaba Cloud's Serverless Stable Diffusion API, detailing deployment steps, supported use cases, performance advantages, and the full set of RESTful endpoints for AI image creation.

AIAPIFunction Compute
0 likes · 19 min read
Deploy a Serverless Stable Diffusion API for Scalable AI Image Generation
Architect
Architect
Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePerformance Optimizationchunked prefill
0 likes · 23 min read
How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 24, 2023 · Artificial Intelligence

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

AI AgentsMultimodal LearningPerformance Optimization
0 likes · 23 min read
Must‑Read AI Agent and LLM Research Papers for Deep Understanding