AI-Native Cloud — Curated Series · 99 articles

Collection size

99 articles

Page 3 of 5

Dec 19, 2024 · Artificial Intelligence

Deploy Open-Source LLMs on Alibaba Cloud Function Compute in 10 Minutes

This guide explains how to quickly launch an open‑source large language model from ModelScope on Alibaba Cloud Function Compute, covering the required cloud services, step‑by‑step deployment, reserved‑instance configuration, and how to invoke the model via the provided domain.

AIAlibaba CloudModelScope

0 likes · 7 min read

Deploy Open-Source LLMs on Alibaba Cloud Function Compute in 10 Minutes

Alibaba Cloud Native

Dec 27, 2023 · Cloud Computing

One‑Click Deployment of LLMs to Alibaba Cloud Function Compute with SwingDeploy

This guide explains how to quickly select a ModelScope open‑source LLM, deploy it to Alibaba Cloud Function Compute using the SwingDeploy one‑click feature, enable reserved idle billing, and evaluate the cost savings compared with traditional GPU provisioning.

GPULLMcost optimization

0 likes · 11 min read

One‑Click Deployment of LLMs to Alibaba Cloud Function Compute with SwingDeploy

Alibaba Cloud Big Data AI Platform

Jul 11, 2024 · Artificial Intelligence

How Llumnix Cuts LLM Serving Latency by 10× with Dynamic Scheduling

Alibaba Cloud's PAI team unveiled Llumnix, a dynamic scheduling framework for large language model serving that dramatically reduces tail latency, speeds high‑priority requests, and cuts costs, earning acceptance at OSDI 2024 and now open‑sourced on GitHub.

AI SystemsDynamic SchedulingLLM serving

0 likes · 5 min read

How Llumnix Cuts LLM Serving Latency by 10× with Dynamic Scheduling

Architect's Alchemy Furnace

Jul 17, 2025 · Artificial Intelligence

Explore the Ultimate Open-Source LLM Catalog: Models, Tools, and Resources

This article compiles a comprehensive, up‑to‑date inventory of open‑source large language models from Chinese and international organizations, detailing each model’s architecture, parameter count, multilingual capabilities, deployment requirements, and associated tools, offering a valuable reference for AI researchers and developers.

AILLMLarge Language Model

0 likes · 50 min read

Explore the Ultimate Open-Source LLM Catalog: Models, Tools, and Resources

AI Explorer

Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV cacheLLM inference

0 likes · 6 min read

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

Alibaba Cloud Infrastructure

Feb 8, 2025 · Artificial Intelligence

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

This guide explains how to deploy a production‑ready DeepSeek‑R1 inference service on Alibaba Cloud ACK using KServe, covering model preparation, storage configuration, service deployment, observability, autoscaling, model acceleration, gray‑release and GPU‑shared inference.

DeepSeekGPUInference

0 likes · 13 min read

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

Baidu Intelligent Cloud Tech Hub

Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference

0 likes · 9 min read

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

DaTaobao Tech

Sep 27, 2023 · Artificial Intelligence

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

FlashAttention‑2 is an IO‑aware exact attention algorithm that cuts GPU HBM traffic through tiling and recomputation, optimizes non‑matmul FLOPs, expands sequence‑parallelism and warp‑level work distribution, delivering up to 2× speedup over FlashAttention, near‑GEMM efficiency, and enabling longer‑context Transformer training and inference for AIGC with fastunet and negligible accuracy loss.

AIGCFlashAttention-2GPU

0 likes · 20 min read

FlashAttention-2: Efficient Attention Algorithm for Transformer Acceleration and AIGC Applications

Old Zhang's AI Learning

Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention

0 likes · 12 min read

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

MaGe Linux Operations

Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU memoryKV cacheLLM OOM

0 likes · 28 min read

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

AI Explorer

Mar 18, 2026 · Artificial Intelligence

Run and Fine‑Tune Hundreds of Open‑Source LLMs Locally with Unsloth

Unsloth offers a unified web UI that accelerates fine‑tuning by up to 2×, cuts VRAM usage by 70% (80% for RL), supports hundreds of open‑source models, and provides simple installation steps for rapid local AI experimentation.

AI workstationGPU OptimizationLLM

0 likes · 6 min read

Run and Fine‑Tune Hundreds of Open‑Source LLMs Locally with Unsloth

Alibaba Cloud Infrastructure

Jan 21, 2026 · Artificial Intelligence

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

This article details how to deploy the 235‑billion‑parameter Qwen3‑235B model using PD‑separation and MoE techniques, explains the associated challenges, and demonstrates a production‑grade solution built on the high‑performance SGLang inference engine and the RoleBasedGroup (RBG) orchestration framework, complete with benchmark results and best‑practice YAML examples.

AIInferenceKubernetes

0 likes · 21 min read

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

Alibaba Cloud Native

Oct 17, 2025 · Artificial Intelligence

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

This article details the cost and speed challenges of embedding vectors in large‑scale log scenarios, analyzes inference framework choices, describes GPU utilization, priority queuing, and pipeline redesigns, and reports a 16‑fold throughput increase and dramatically lower per‑request costs.

EmbeddingGPU OptimizationTriton

0 likes · 8 min read

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

Baobao Algorithm Notes

Jan 7, 2025 · Artificial Intelligence

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI performanceDeepSeekLarge Language Model

0 likes · 8 min read

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

Alimama Tech

Feb 12, 2025 · Artificial Intelligence

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

HighService, Alibaba’s Pythonic AI service framework, accelerates large‑model inference and maximizes GPU utilization by separating CPU‑GPU processes, offering out‑of‑the‑box quantization, parallelism and caching, and dynamically reallocating idle GPUs across clusters through a master‑worker scheduler to keep online latency low while boosting offline throughput for diffusion and LLM workloads.

AI ServiceHigh PerformanceModel Inference

0 likes · 16 min read

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

Alibaba Cloud Infrastructure

Feb 13, 2025 · Artificial Intelligence

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

This article explains how to quickly deploy the full‑parameter DeepSeek‑R1 671B model in a multi‑node GPU‑enabled Kubernetes cluster on Alibaba Cloud ACK, covering prerequisites, model parallelism, vLLM‑Ray distributed deployment, service verification, and integration with Dify to build a private AI Q&A assistant.

DeepSeekDifyDistributed Deployment

0 likes · 12 min read

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

Alibaba Cloud Native

Jul 30, 2024 · Cloud Native

Deploy ComfyUI as a Serverless API for Scalable AI Image Generation

This article explains how to transform ComfyUI into a serverless API using Alibaba Cloud Function Compute, detailing the challenges of GPU resource costs, high concurrency, and usability, while providing a step‑by‑step guide, code examples, and best‑practice recommendations for building scalable AI drawing applications.

AI image generationAPIComfyUI

0 likes · 21 min read

Deploy ComfyUI as a Serverless API for Scalable AI Image Generation

Old Zhang's AI Learning

Apr 26, 2026 · Artificial Intelligence

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

The article dissects DeepSeek‑V4’s local deployment using vLLM, explaining the steep hardware requirements, the complex heterogeneous KV‑cache architecture, and the aggressive kernel‑fusion and multi‑stream optimizations that together make high‑context inference both memory‑intensive and engineering‑heavy.

DeepSeek V4GPU memoryKV cache

0 likes · 15 min read

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

Alibaba Cloud Developer

Jan 15, 2026 · Artificial Intelligence

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

This article explains how a hierarchical sparse‑attention framework redesigns KVCache storage across GPU, CPU, and remote memory, eliminates bandwidth and capacity bottlenecks, and enables efficient inference for 128K‑token and larger contexts with dramatically reduced GPU memory usage and higher throughput.

Dynamic Sparse AttentionGPU memory optimizationHierarchical Storage

0 likes · 20 min read

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

DataFunSummit

Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu's machine‑learning platform lead Wang Xin's presentation on the ZhiLight large‑model inference framework, covering model execution mechanisms, GPU workload analysis, pipeline and tensor parallelism, GPU architecture evolution, open‑source engine comparisons, ZhiLight's compute‑communication overlap and quantization optimizations, benchmark results, supported models, and future directions.

GPUInferenceLLM

0 likes · 13 min read

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations