Tagged articles
11 articles
Page 1 of 1
Huolala Tech
Huolala Tech
Mar 6, 2026 · Artificial Intelligence

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation
0 likes · 18 min read
How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 12, 2026 · Artificial Intelligence

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

CUDA Graphcold-start optimizationlarge-model inference
0 likes · 16 min read
How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 5, 2026 · Artificial Intelligence

How Baidu Tianchi Supernodes Supercharge Large‑Model Inference: Architecture, Deployment, and Optimization

This article details Baidu's Tianchi supernode design and software tuning—covering hardware scale‑up, deployment planning, Prefill and Decode stage optimizations, quantization strategies, and communication schemes—to dramatically boost large‑model inference throughput and latency while lowering token‑cost.

AI InfrastructureParallelismPerformance Optimization
0 likes · 20 min read
How Baidu Tianchi Supernodes Supercharge Large‑Model Inference: Architecture, Deployment, and Optimization
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 14, 2025 · Artificial Intelligence

How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Mooncake, an open‑source large‑model inference platform, introduces a KVCache‑centric architecture that dramatically improves throughput, reduces latency and cuts inference costs by up to 20%, while integrating with frameworks like SGLang and vLLM and leveraging Alibaba Cloud’s eRDMA and GPUDirect technologies for scalable, high‑performance deployments.

AI PerformanceAlibaba CloudDistributed Systems
0 likes · 7 min read
How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost
DeWu Technology
DeWu Technology
Feb 17, 2025 · Artificial Intelligence

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

AIDistributed inferenceGPU Acceleration
0 likes · 22 min read
Optimizing Large Model Inference: High‑Performance Frameworks and Techniques
Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT
0 likes · 10 min read
Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM Performance
0 likes · 10 min read
How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency
DataFunSummit
DataFunSummit
Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUVirtual Memoryattention
0 likes · 25 min read
Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 29, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

In June 2024, Alibaba Cloud and Tsinghua University's MADSys Lab announced the open‑source Mooncake architecture, a KVCache‑centered large‑model inference framework that boosts throughput, lowers cost, and standardizes resource‑pooling techniques for high‑performance AI inference across industry and academia.

KVCacheTsinghua Universitylarge-model inference
0 likes · 4 min read
Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University
Volcano Engine Developer Services
Volcano Engine Developer Services
Jun 20, 2023 · Artificial Intelligence

Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture

Large-model offline (batch) inference, which processes massive data on billion-parameter models, faces GPU memory and distributed scheduling challenges; this article explains how Ray's cloud-native framework, model parallelism, and Ray Datasets pipelines address these issues, improve throughput, and enable elastic, efficient GPU utilization.

GPU utilizationRaycloud-native
0 likes · 16 min read
Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture
Baidu Geek Talk
Baidu Geek Talk
Mar 9, 2023 · Industry Insights

How Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle Boost AI Painting Performance

This article analyzes Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle‑optimized Stable Diffusion models, presenting benchmark comparisons, hardware‑specific speed and memory gains, and the underlying inference optimizations that enable low‑cost, high‑throughput AI‑generated image creation.

AI paintingAIGCGPU Acceleration
0 likes · 9 min read
How Baidu’s ERNIE‑ViLG 2.0 and PaddlePaddle Boost AI Painting Performance