AI-Native Cloud — Curated Series · 99 articles

Collection size

99 articles

Page 2 of 5

Jan 6, 2026 · Artificial Intelligence

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

GPU inferenceLoRA adaptersModel Serving

0 likes · 35 min read

How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

Ops Development Stories

Sep 19, 2024 · Artificial Intelligence

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

This tutorial walks through setting up a local k3d cluster, installing Higress, and using its AI plugins—including AI Proxy, AI JSON formatter, AI Agent, and AI Statistics—to integrate and observe Alibaba Cloud's Qwen large language models across various use cases such as weather and flight queries.

AI gatewayAI pluginsHigress

0 likes · 30 min read

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

Alibaba Cloud Native

Mar 3, 2026 · Cloud Native

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

This guide explains how to quickly deploy the new Qwen3.5-397B-A17B open‑source large model using Alibaba Cloud Function Compute's serverless GPU service, covering model features, deployment steps, required commands, and performance benefits.

AIQwen3.5Serverless GPU

0 likes · 5 min read

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

Old Meng AI Explorer

Apr 20, 2026 · Artificial Intelligence

Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide

This article explains what NVIDIA NIM is, compares its generous free quota to other LLM providers, lists the supported free models, walks through a five‑minute sign‑up, shows three code examples for calling the API, offers model‑selection advice, and provides a hands‑on case for building a free AI chat interface.

AI ModelsAPI integrationFree LLM API

0 likes · 16 min read

Unlock Free High‑Performance LLM APIs with NVIDIA NIM – A Step‑by‑Step Guide

Alibaba Cloud Native

Aug 21, 2025 · Cloud Native

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

This article explains why traditional load‑balancing methods fall short for large language model services and introduces Higress AI Gateway's three specialized algorithms—global minimum‑request, prefix‑matching, and GPU‑aware load balancing—detailing their design, Redis‑based implementation, deployment steps, and performance gains.

GPULLMLoad Balancing

0 likes · 11 min read

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

Baidu Intelligent Cloud Tech Hub

Dec 4, 2025 · Artificial Intelligence

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Cache offloadGPU‑CPU optimizationLLM inference

0 likes · 21 min read

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

21CTO

Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentPythonQuantization

0 likes · 13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

Alibaba Cloud Infrastructure

Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeKubernetes

0 likes · 13 min read

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

Alibaba Cloud Developer

Mar 25, 2025 · Cloud Computing

How to Deploy QwQ-32B Model on Alibaba Cloud Function Compute Using CAP

This guide walks you through deploying the open‑source QwQ‑32B model on Alibaba Cloud Function Compute with CAP, covering required services, step‑by‑step deployment, cost notes, accessing the demo UI, interacting with the model, scaling settings, and resource cleanup.

Alibaba CloudCAPQwQ-32B

0 likes · 7 min read

How to Deploy QwQ-32B Model on Alibaba Cloud Function Compute Using CAP

Architect

Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePerformance optimizationSpeculative Decoding

0 likes · 23 min read

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

Baobao Algorithm Notes

Dec 24, 2023 · Artificial Intelligence

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

AI agentsMultimodal LearningPerformance optimization

0 likes · 23 min read

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

ByteDance Cloud Native

Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1Distributed inference

0 likes · 14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

HyperAI Super Neural

Apr 8, 2026 · Artificial Intelligence

One‑Click Deploy Gemma‑4‑31B with 256K Context, Matching Qwen 3.5 397B Performance

HyperAI’s tutorial lets developers instantly launch the open‑source Gemma‑4‑31B model—supporting multimodal input, up to 256 K token context and over 140 languages—through a one‑click deployment on RTX 6000 or RTX 5090 GPUs, with detailed step‑by‑step instructions and optional compute credits.

256k contextGemma-4-31BHyperAI

0 likes · 5 min read

One‑Click Deploy Gemma‑4‑31B with 256K Context, Matching Qwen 3.5 397B Performance

Alibaba Cloud Native

May 1, 2023 · Cloud Native

Deploy FastChat on Alibaba Cloud ASK: A Serverless AI Model Tutorial

This guide shows how to quickly deploy the open‑source FastChat AI assistant on Alibaba Cloud ASK's serverless Kubernetes platform, covering prerequisites, YAML configuration, GPU handling, verification steps, and three usage scenarios including web UI, API calls, and a VSCode extension.

AIASKFastChat

0 likes · 12 min read

Deploy FastChat on Alibaba Cloud ASK: A Serverless AI Model Tutorial

Alibaba Cloud Native

Mar 27, 2025 · Cloud Native

Deploy the QwQ‑32B LLM on Alibaba Cloud Function Compute with CAP in Minutes

This guide walks you through deploying the open‑source QwQ‑32B model on Alibaba Cloud Function Compute using the Cloud Application Platform (CAP), covering architecture, required services, account setup, step‑by‑step deployment, cost considerations, model interaction via Open WebUI and Chatbox, scaling configuration, and resource cleanup.

CAPOllamaOpen WebUI

0 likes · 8 min read

Deploy the QwQ‑32B LLM on Alibaba Cloud Function Compute with CAP in Minutes

ByteDance Cloud Native

Mar 7, 2025 · Artificial Intelligence

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

This guide walks you through the end‑to‑end process of deploying the open‑source QwQ‑32B inference model on Volcengine's cloud platform, covering GPU ECS selection, VKE cluster creation, continuous delivery CP setup, vLLM service launch, and API gateway exposure.

GPU ECSLarge Language ModelQwQ-32B

0 likes · 8 min read

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

Old Zhang's AI Learning

Apr 20, 2026 · Artificial Intelligence

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

The article details how to deploy the 4‑bit quantized Qwen3.6-35B model with vLLM 0.17 (and 0.19.1 patch) on a Docker container, compares its memory usage and token‑generation speed to Qwen3.5‑35B, and shares practical scripts and observed performance of roughly 150 tokens per second.

DockerLLM deploymentPerformance Benchmark

0 likes · 5 min read

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

Huawei Cloud Developer Alliance

Apr 2, 2026 · Cloud Native

How Kthena Enables Production‑Grade LLM Inference on Kubernetes

This article analyzes the cloud‑native challenges of deploying large‑model inference on Kubernetes and presents Kthena’s architecture—ModelServing, Router, Autoscaler, and ModelBooster—along with Volcano integration, vLLM‑Ascend setup, and a real‑world Qwen3‑235B deployment case, highlighting performance gains and future directions.

KthenaKubernetesLLM

0 likes · 13 min read

How Kthena Enables Production‑Grade LLM Inference on Kubernetes

Old Zhang's AI Learning

Apr 25, 2026 · Artificial Intelligence

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.

DeepSeek V4DockerFP8 quantization

0 likes · 6 min read

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

MaGe Linux Operations

Jul 21, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA setup, native and Docker deployment methods, detailed parameter tuning, advanced sharding strategies, troubleshooting, performance optimization, and production‑grade monitoring to maximize throughput and stability of large language models.

AI DeploymentCUDALoad Balancing

0 likes · 16 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production