Tagged articles
14 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 21, 2026 · Artificial Intelligence

Prefill-as-a-Service Boosts LLM Inference Throughput by 54%

A joint Moonshot AI and Tsinghua study shows that the Prefill-as-a-Service (PrfaaS) architecture, enabled by hybrid‑attention models that shrink KVCache size, can offload long Prefill work to a remote cluster and, with dual‑timescale scheduling, achieve a 54% throughput gain over homogeneous PD deployment and 32% over naive heterogeneous setups.

Distributed inferenceKVCache optimizationLLM inference
0 likes · 12 min read
Prefill-as-a-Service Boosts LLM Inference Throughput by 54%
Tencent Technical Engineering
Tencent Technical Engineering
Jan 23, 2026 · Artificial Intelligence

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

This article surveys the 2025 AI infrastructure landscape, covering distributed inference with PD‑separation, dynamic DOPD scheduling, AFD attention‑FFN disaggregation, high‑bandwidth cross‑machine communication libraries, the TileLang programming model, RL train‑inference decoupling via SeamlessFlow, and secure, low‑latency agent infra designs for future large‑scale models.

AI InfrastructureAgent SystemsDistributed inference
0 likes · 27 min read
Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 24, 2025 · Artificial Intelligence

How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens

The article explains how the newly merged Context Parallelism (CP) technique in SGLang, combined with DeepSeek V3.2's Sparse Attention architecture, reduces first‑token latency by up to 80% and alleviates memory pressure for ultra‑long 128K‑token sequences, detailing both algorithmic innovations and engineering solutions.

AI InfrastructureContext ParallelismDistributed inference
0 likes · 10 min read
How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 15, 2025 · Artificial Intelligence

Baidu Baige’s Breakthrough: Orchestrating Giant LLM Inference with Silent Instances

The article details Baidu Baige’s next‑generation distributed inference platform for trillion‑parameter LLMs, explaining how automated orchestration, the FedDeployment abstraction, SplitService unified view, Adaptive HPA predictive scaling, Silent Instances for second‑level activation, and the Staggered Batched Scheduler eliminate scaling limits, reduce TTFT by 30‑40%, boost throughput by up to 20%, and achieve cost‑effective, elastic AI compute.

Distributed inferenceKubernetesLLM
0 likes · 23 min read
Baidu Baige’s Breakthrough: Orchestrating Giant LLM Inference with Silent Instances
Huolala Tech
Huolala Tech
May 29, 2025 · Artificial Intelligence

How LWS Enables Scalable Multi‑Node Large Model Deployment on Kubernetes

The article explains how the Dolphin AI platform tackles large‑model deployment challenges by replacing standard Kubernetes Deployments with LeaderWorkerSet, detailing its architecture, features, installation steps, example configurations, testing, scaling, rolling updates, fault recovery, and future roadmap for AI workloads.

AI PlatformDistributed inferenceKubernetes
0 likes · 12 min read
How LWS Enables Scalable Multi‑Node Large Model Deployment on Kubernetes
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayDistributed inferenceKubernetes
0 likes · 19 min read
Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM
ByteDance Cloud Native
ByteDance Cloud Native
Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1Distributed inference
0 likes · 14 min read
How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours
DeWu Technology
DeWu Technology
Feb 17, 2025 · Artificial Intelligence

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

AIDistributed inferenceGPU Acceleration
0 likes · 22 min read
Optimizing Large Model Inference: High‑Performance Frameworks and Techniques
IT Services Circle
IT Services Circle
Feb 7, 2025 · Artificial Intelligence

Building Low‑Cost AI Clusters with Old Phones Using Exo and Open WebUI

This article introduces Exo, an open‑source platform that lets you turn idle smartphones, tablets, and laptops into a distributed AI cluster capable of running large language models, and shows how Open WebUI provides a user‑friendly interface for deploying private AI assistants.

AI clusteringDistributed inferenceExo
0 likes · 6 min read
Building Low‑Cost AI Clusters with Old Phones Using Exo and Open WebUI
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 31, 2023 · Artificial Intelligence

Boosting Large Model Inference: High‑Performance Optimization Techniques

This article explains the background, challenges, and high‑performance optimization methods for deploying large language and multimodal models, covering inference workflow analysis, distributed concurrency, latency reduction, quantization strategies, and service throughput improvements to achieve industry‑leading speed and memory efficiency.

Distributed inferencemultimodal diffusionquantization
0 likes · 12 min read
Boosting Large Model Inference: High‑Performance Optimization Techniques
DataFunTalk
DataFunTalk
Jul 8, 2023 · Big Data

Key Technologies and Applications of Semantic Knowledge Management in Ant Financial Knowledge Graph Platform

This article presents Ant Group's large‑scale financial knowledge graph platform, detailing its semantic knowledge representation, hybrid graph model, distributed management architecture, core capabilities such as knowledge evolution and cross‑domain fusion, and showcases applications like anti‑fraud capital‑flow analysis and future DataFabric‑oriented knowledge sharing.

Distributed inferenceKnowledge Graphgraph database
0 likes · 18 min read
Key Technologies and Applications of Semantic Knowledge Management in Ant Financial Knowledge Graph Platform
YunZhu Net Technology Team
YunZhu Net Technology Team
Oct 22, 2021 · Artificial Intelligence

Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior

This article reviews deep learning and AI frameworks, highlights challenges of online model serving, and presents Avior—a lightweight, distributed inference engine designed for high‑performance AI services, detailing its architecture, layer design, benchmark results, and future development plans.

AI frameworksAviorDeep Learning
0 likes · 8 min read
Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior