Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure
This article surveys the 2025 AI infrastructure landscape, covering distributed inference with PD‑separation, dynamic DOPD scheduling, AFD attention‑FFN disaggregation, high‑bandwidth cross‑machine communication libraries, the TileLang programming model, RL train‑inference decoupling via SeamlessFlow, and secure, low‑latency agent infra designs for future large‑scale models.
1. Distributed Inference and PD‑separation
Mixture‑of‑Experts (MoE) models use a gating network to activate only a small subset of expert sub‑models, creating sparse activation. Decoder‑only LLMs split inference into a compute‑heavy prefill stage (building KV cache) and a memory‑intensive decode stage (token generation). Because these stages have divergent resource demands, the PD‑separation concept (DistServe) deploys prefill (P) and decode (D) on separate devices or GPUs, eliminating interference and allowing independent optimization of time‑to‑first‑token (TTFT) and time‑per‑token (TPOT).
Dynamic PD‑separation (DOPD) adds a short‑term load predictor and a lightweight controller that automatically adjusts the P/D instance ratio, scales resources, and mitigates producer‑consumer imbalance for mixed‑length requests.
2. Attention‑FFN Disaggregation (AFD)
Within the decode stage, the attention module is memory‑bound (few parameters, heavy KV traffic) while the feed‑forward network (FFN) is compute‑bound (large parameters). AFD places attention and FFN on different devices, further improving resource utilization and reducing hardware cost. Experiments on NVIDIA A800 and Ascend 910B GPUs show higher throughput and lower deployment cost compared with monolithic deployment.
3. Cross‑machine Expert Parallel (EP) Communication
MoE models with thousands of experts cannot fit all parameters on a single GPU. DeepEP (open‑source library from DeepSeek) provides sparse‑aware communication that transmits only the KV entries required by the activated experts, drastically reducing bandwidth and latency. It supports native FP8 precision and NVLink‑RDMA forwarding.
TRMT builds on DeepEP to optimize performance on RoCE networks, addressing the latency penalty of standard NCCL. Benchmarks on DeepSeek‑V3/R1 show significant bandwidth and latency improvements.
4. TileLang – Data‑flow‑centric Programming Model
TileLang decouples data flow from scheduling. Developers describe tensors and high‑level pipelines; the compiler automatically generates optimal thread‑block‑grid mappings. Three tiers are offered:
Entry‑level : only data flow, compiler handles all scheduling.
Intermediate : data flow plus callable operators for flexibility.
Fine‑grained : explicit thread‑level control for extreme performance.
A matrix‑multiplication example demonstrates that a few high‑level statements can produce an M×K × K×N multiplication with optimal performance. TileLang has been used to re‑implement several DeepSeek V3.2 operators.
5. RL Train‑Inference Separation (SeamlessFlow)
Training data for RL agents is generated by inference, creating a tight coupling that leads to resource bubbles and synchronization issues. SeamlessFlow introduces a trajectory manager that records every LLM‑Agent interaction (inputs, outputs, prompts) to provide a transparent data‑exchange plane.
It also uses a tag‑based resource scheduler: compute slots are labeled as “train” or “infer” and can be dynamically reassigned, eliminating idle compute periods.
6. Agent Infrastructure and Skills Security
Popular agent frameworks (AutoGPT, LangGraph, Dify, CrewAI, AutoGen) differ in autonomy and extensibility. Skills (Skill.md) encode procedural knowledge for agents but introduce attack surfaces:
Semantic hijacking – malicious skill descriptions can be triggered by ambiguous user commands.
Ghost commands – hidden instructions embedded in skill metadata.
Malicious scripts – Skills may invoke local Bash scripts with the agent’s privileges.
Over‑permissive tool activation – e.g., allowed‑tools: Bash can grant zero‑click RCE.
Recommendations:
Source‑code vetting of Skills.
Sandbox enforcement with fine‑grained permission models.
Runtime auditing and tool‑level access control (OAuth, token limits).
7. Production‑grade Agent Infra Requirements
Strong isolation to prevent code escape.
Millisecond‑scale environment provisioning.
State persistence across crashes.
Full observability and audit trails.
Tencent Cloud’s stack combines Cube sandbox (~80 ms latency), MVM snapshot + image pre‑warming (zero cold start), guest‑kernel isolation, and a multi‑sandbox VPC file system to achieve secure, low‑cost, serverless‑like operation.
8. Super‑Node Hardware Foundations
NVLink‑based super‑nodes (e.g., Vera Rubin NVL144) aggregate up to 144 GPUs with intra‑node bandwidth of 260 TB/s, enabling PD‑separated inference for ultra‑long‑context models (1 M+ tokens). Such high‑bandwidth domains are essential because context length growth causes quadratic compute demand and linear memory demand.
9. Model‑System Co‑design
Step‑3 introduces the notion of arithmetic intensity (FLOPs per byte of memory traffic). Matching a model’s arithmetic intensity to the hardware roofline (compute‑to‑bandwidth ratio) determines whether compute or memory bandwidth is the bottleneck. By designing the attention module to achieve arithmetic intensity close to the roofline of NVIDIA A800 or Ascend 910B, Step‑3 attains superior efficiency.
Key formulas (re‑expressed for clarity):
BatchSize_MoE ≥ FLOPs / (2 × S × Bandwidth) S ≥ (H × FLOPs × L) / (Net × Bandwidth × β)where S is sparsity, H hidden size, L number of layers, Net network bandwidth, and β a constant reflecting quantization precision and pipeline depth. Properly balancing these variables yields high‑throughput inference for long‑context MoE models.
10. Outlook
From 2024 to 2025, model capabilities improved roughly tenfold due to co‑design of models, systems, and data. The next wave (2026) will focus on finer‑grained engineering, cost‑effective hardware, and robust AI‑Agent infrastructure that brings “Jarvis‑like” assistants closer to reality.
References
DistServe: Disaggregating Prefill and Decoding for Goodput‑optimized Large Language Model Serving – https://arxiv.org/pdf/2401.09670
DOPD: A Dynamic PD‑Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving – https://arxiv.org/pdf/2511.20982
Step‑3 is Large yet Affordable: Model‑system Co‑design for Cost‑effective Decoding – https://arxiv.org/pdf/2507.19427
DeepEP – https://github.com/deepseek-ai/DeepEP
TileLang: A Composable Tiled Programming Model for AI Systems – https://arxiv.org/pdf/2504.17577
SeamlessFlow: A Trainer‑Agent Isolation RL Framework Achieving Bubble‑Free Pipelines via Tag Scheduling – https://arxiv.org/pdf/2508.11553
NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads – https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
