AI-Native Cloud — Curated Series · 99 articles

Collection size

99 articles

Page 5 of 5

Feb 24, 2025 · Artificial Intelligence

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

DeepSeekFlash-MLAGPU Optimization

0 likes · 8 min read

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Alibaba Cloud Observability

Oct 20, 2025 · Artificial Intelligence

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

This article examines the high cost and low throughput of embedding vectors in log‑processing scenarios, analyzes the performance bottlenecks of inference frameworks, and details a series of cloud‑native optimizations—including switching to vLLM, deploying multiple model replicas with Triton, decoupling tokenization, and priority queuing—that together raise throughput by 16× and reduce per‑token pricing by two orders of magnitude.

EmbeddingGPU inferencePerformance optimization

0 likes · 9 min read

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

Alibaba Cloud Big Data AI Platform

Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI infrastructureAutoTunerLLM inference

0 likes · 11 min read

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

HyperAI Super Neural

Mar 10, 2026 · Artificial Intelligence

Deploy Popular Open‑Source LLMs on Free CPU in Minutes – Qwen3.5, DeepSeek‑R1, Gemma 3, Llama 3.2 and More

This guide shows how to use HyperAI’s free CPU quota to quickly deploy popular open‑source LLMs such as Qwen3.5, DeepSeek‑R1, Gemma 3 and Llama 3.2, walking through environment setup, model download, and inference execution without needing local GPU hardware.

CPUHyperAILLM

0 likes · 6 min read

Deploy Popular Open‑Source LLMs on Free CPU in Minutes – Qwen3.5, DeepSeek‑R1, Gemma 3, Llama 3.2 and More

Kuaishou Large Model

Jul 11, 2024 · Artificial Intelligence

Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training

Researchers from Kwai’s large-model team present a novel training system that combines pipeline-parallel-aware activation offloading with a compute-memory balanced checkpointing strategy, enabling lossless acceleration of large language models, achieving up to 42.7% MFU on 256 NVIDIA H800 GPUs while reducing memory usage.

GPU trainingKwaiactivation offloading

0 likes · 13 min read

Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training

Alibaba Cloud Infrastructure

Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayDistributed inferenceKubernetes

0 likes · 19 min read

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

Old Zhang's AI Learning

Mar 28, 2026 · Artificial Intelligence

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.

KV cacheLLM inferenceTurboQuant

0 likes · 8 min read

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

Alibaba Cloud Big Data AI Platform

Jul 16, 2025 · Artificial Intelligence

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

The paper introduces ChunkFlow, an efficient training framework for variable‑length and ultra‑long sequence datasets that powers Qwen models, achieving up to 4.53× speedup over Megatron‑LM and more than 2× overall performance gains by reorganizing data into fixed‑size chunks and employing a state‑aware scheduler.

AI performanceChunkFlowDistributed Training

0 likes · 7 min read

ChunkFlow: Accelerating Long‑Context Model Fine‑Tuning Up to 4.5× Faster

NewBeeNLP

Feb 11, 2024 · Industry Insights

What 2023 Taught Us About LLMs and AI‑Guided Optimization

The author reviews a year of rapid progress in large language models, highlighting breakthrough papers such as Positional Interpolation, StreamingLLM, Deja Vu, and RLCD, and discusses how AI‑guided optimization techniques like SurCo, LANCER, and GenCo are reshaping research and industry applications.

AI OptimizationLLMTransformers

0 likes · 13 min read

What 2023 Taught Us About LLMs and AI‑Guided Optimization

Machine Learning Algorithms & Natural Language Processing

Feb 28, 2026 · Artificial Intelligence

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

The article analyzes the KV‑Cache storage I/O bottleneck that limits agentic LLM inference, introduces the DualPath architecture with a storage‑to‑decode data path and RDMA‑based scheduling, and shows up to 1.87× offline and 1.96× online throughput gains on large‑scale GPU clusters.

DeepSeekDualPathKV cache

0 likes · 13 min read

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

Old Zhang's AI Learning

Apr 22, 2026 · Artificial Intelligence

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

This article evaluates the Red Hat‑produced NVFP4‑quantized Qwen3.6‑35B model deployed with vLLM inside Docker on a dual‑RTX 4090 server, presenting accuracy gains, memory usage, initialization times, GPU compatibility notes, and practical deployment recommendations.

DockerNVFP4Quantization

0 likes · 8 min read

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

Old Zhang's AI Learning

Feb 3, 2026 · Artificial Intelligence

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Step‑3.5‑Flash, a 196‑billion‑parameter open‑source LLM that activates only 11 B per token via a Mixture‑of‑Experts design, delivers 3‑plus‑times faster inference, matches top‑tier closed‑source models on SWE‑bench and other benchmarks, supports 256 K context, runs on consumer‑grade hardware, and is already integrated into vLLM, SGLang, and Claude Code, though it has known token‑efficiency and domain‑stability limitations.

LLM benchmarkMoEMulti‑Token Prediction

0 likes · 11 min read

Step‑3.5‑Flash: Lightning‑Fast Inference with 196B Params, Only 11B Active (vLLM)

Alibaba Cloud Native

Jan 6, 2024 · Cloud Computing

Deploy ModelScope Models to Alibaba Cloud Function Compute in 5 Minutes

This guide walks readers through using ModelScope’s SwingDeploy service to locate, configure, and instantly deploy open‑source AI models to Alibaba Cloud Function Compute, explaining the resources created, how to invoke the model via HTTP triggers, and how to optimize performance with provisioned instances, logging, and concurrency settings.

AI model servingAlibaba CloudModelScope

0 likes · 15 min read

Deploy ModelScope Models to Alibaba Cloud Function Compute in 5 Minutes

Baidu Geek Talk

Dec 10, 2025 · Artificial Intelligence

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.

Cache offloadGPU memoryLLM inference

0 likes · 20 min read

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Kuaishou Tech

Nov 21, 2024 · Artificial Intelligence

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.

Distributed TrainingGPU clustersactivation rematerialization

0 likes · 27 min read

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

DataFunSummit

Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationContinuous batchingGPU

0 likes · 20 min read

Accelerating Large Language Model Inference with the YiNian LLM Framework

Alibaba Cloud Native

Jun 24, 2023 · Cloud Native

How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

This guide walks through deploying a Bloom 7B1 large language model for distributed inference on Alibaba Cloud Container Service (ACK) using DeepSpeed, Arena, and Kubernetes, covering environment setup, model configuration, service launch, verification, and Ingress exposure.

ACKArenaDeepSpeed

0 likes · 14 min read

How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

Baobao Algorithm Notes

Sep 28, 2025 · Artificial Intelligence

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

GPU memoryKV cacheLLM

0 likes · 18 min read

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

Baobao Algorithm Notes

Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU memoryLLM inferencePagedAttention

0 likes · 25 min read

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference