Tagged articles

14 articles

Page 1 of 1

Apr 21, 2026 · Artificial Intelligence

Prefill-as-a-Service Boosts LLM Inference Throughput by 54%

A joint Moonshot AI and Tsinghua study shows that the Prefill-as-a-Service (PrfaaS) architecture, enabled by hybrid‑attention models that shrink KVCache size, can offload long Prefill work to a remote cluster and, with dual‑timescale scheduling, achieve a 54% throughput gain over homogeneous PD deployment and 32% over naive heterogeneous setups.

Distributed inferenceKVCache optimizationLLM inference

0 likes · 12 min read

Prefill-as-a-Service Boosts LLM Inference Throughput by 54%

Tencent Technical Engineering

Jan 23, 2026 · Artificial Intelligence

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

This article surveys the 2025 AI infrastructure landscape, covering distributed inference with PD‑separation, dynamic DOPD scheduling, AFD attention‑FFN disaggregation, high‑bandwidth cross‑machine communication libraries, the TileLang programming model, RL train‑inference decoupling via SeamlessFlow, and secure, low‑latency agent infra designs for future large‑scale models.

AI InfrastructureAgent SystemsDistributed inference

0 likes · 27 min read

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

Baidu Intelligent Cloud Tech Hub

Dec 24, 2025 · Artificial Intelligence

How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens

The article explains how the newly merged Context Parallelism (CP) technique in SGLang, combined with DeepSeek V3.2's Sparse Attention architecture, reduces first‑token latency by up to 80% and alleviates memory pressure for ultra‑long 128K‑token sequences, detailing both algorithmic innovations and engineering solutions.

AI InfrastructureContext ParallelismDistributed inference

0 likes · 10 min read

How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens

Baidu Intelligent Cloud Tech Hub

Dec 15, 2025 · Artificial Intelligence

Baidu Baige’s Breakthrough: Orchestrating Giant LLM Inference with Silent Instances

The article details Baidu Baige’s next‑generation distributed inference platform for trillion‑parameter LLMs, explaining how automated orchestration, the FedDeployment abstraction, SplitService unified view, Adaptive HPA predictive scaling, Silent Instances for second‑level activation, and the Staggered Batched Scheduler eliminate scaling limits, reduce TTFT by 30‑40%, boost throughput by up to 20%, and achieve cost‑effective, elastic AI compute.

Distributed inferenceKubernetesLLM

0 likes · 23 min read

Baidu Baige’s Breakthrough: Orchestrating Giant LLM Inference with Silent Instances

BirdNest Tech Talk

Oct 15, 2025 · Artificial Intelligence

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

This article walks through the DeepSeek‑V3.2‑Exp inference codebase, detailing its MoE architecture, Multi‑Head Latent Attention, FP8 quantization, custom CUDA kernels, and 8‑GPU NCCL‑based distributed execution from initialization through prefill and decode stages.

CUDADistributed inferenceFP8 quantization

0 likes · 9 min read

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

Huolala Tech

May 29, 2025 · Artificial Intelligence

How LWS Enables Scalable Multi‑Node Large Model Deployment on Kubernetes

The article explains how the Dolphin AI platform tackles large‑model deployment challenges by replacing standard Kubernetes Deployments with LeaderWorkerSet, detailing its architecture, features, installation steps, example configurations, testing, scaling, rolling updates, fault recovery, and future roadmap for AI workloads.

AI PlatformDistributed inferenceKubernetes

0 likes · 12 min read

How LWS Enables Scalable Multi‑Node Large Model Deployment on Kubernetes

Alibaba Cloud Infrastructure

Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayDistributed inferenceKubernetes

0 likes · 19 min read

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

ByteDance Cloud Native

Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1Distributed inference

0 likes · 14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

DeWu Technology

Feb 17, 2025 · Artificial Intelligence

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

AIDistributed inferenceGPU Acceleration

0 likes · 22 min read

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

IT Services Circle

Feb 7, 2025 · Artificial Intelligence

Building Low‑Cost AI Clusters with Old Phones Using Exo and Open WebUI

This article introduces Exo, an open‑source platform that lets you turn idle smartphones, tablets, and laptops into a distributed AI cluster capable of running large language models, and shows how Open WebUI provides a user‑friendly interface for deploying private AI assistants.

AI clusteringDistributed inferenceExo

0 likes · 6 min read

Building Low‑Cost AI Clusters with Old Phones Using Exo and Open WebUI

Baidu Intelligent Cloud Tech Hub

Jul 31, 2023 · Artificial Intelligence

Boosting Large Model Inference: High‑Performance Optimization Techniques

This article explains the background, challenges, and high‑performance optimization methods for deploying large language and multimodal models, covering inference workflow analysis, distributed concurrency, latency reduction, quantization strategies, and service throughput improvements to achieve industry‑leading speed and memory efficiency.

Distributed inferencemultimodal diffusionquantization

0 likes · 12 min read

Boosting Large Model Inference: High‑Performance Optimization Techniques

DataFunTalk

Jul 8, 2023 · Big Data

Key Technologies and Applications of Semantic Knowledge Management in Ant Financial Knowledge Graph Platform

This article presents Ant Group's large‑scale financial knowledge graph platform, detailing its semantic knowledge representation, hybrid graph model, distributed management architecture, core capabilities such as knowledge evolution and cross‑domain fusion, and showcases applications like anti‑fraud capital‑flow analysis and future DataFabric‑oriented knowledge sharing.

Distributed inferenceKnowledge Graphgraph database

0 likes · 18 min read

Key Technologies and Applications of Semantic Knowledge Management in Ant Financial Knowledge Graph Platform

Alibaba Cloud Native

Jun 24, 2023 · Cloud Native

How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

This guide walks through deploying a Bloom 7B1 large language model for distributed inference on Alibaba Cloud Container Service (ACK) using DeepSpeed, Arena, and Kubernetes, covering environment setup, model configuration, service launch, verification, and Ingress exposure.

ACKArenaCloud Native

0 likes · 14 min read

How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

YunZhu Net Technology Team

Oct 22, 2021 · Artificial Intelligence

Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior

This article reviews deep learning and AI frameworks, highlights challenges of online model serving, and presents Avior—a lightweight, distributed inference engine designed for high‑performance AI services, detailing its architecture, layer design, benchmark results, and future development plans.

AI frameworksAviorDeep Learning

0 likes · 8 min read

Deep Learning Overview and Introduction to the Lightweight Distributed Inference Engine Avior