Artificial Intelligence 27 min read

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

This article surveys the 2025 AI infrastructure landscape, covering distributed inference with PD‑separation, dynamic DOPD scheduling, AFD attention‑FFN disaggregation, high‑bandwidth cross‑machine communication libraries, the TileLang programming model, RL train‑inference decoupling via SeamlessFlow, and secure, low‑latency agent infra designs for future large‑scale models.

Tencent Technical Engineering

Jan 23, 2026

Unlocking AI Infra: Distributed Inference, PD Separation, TileLang, and Next‑Gen Agent Infrastructure

1. Distributed Inference and PD‑separation

Mixture‑of‑Experts (MoE) models use a gating network to activate only a small subset of expert sub‑models, creating sparse activation. Decoder‑only LLMs split inference into a compute‑heavy prefill stage (building KV cache) and a memory‑intensive decode stage (token generation). Because these stages have divergent resource demands, the PD‑separation concept (DistServe) deploys prefill (P) and decode (D) on separate devices or GPUs, eliminating interference and allowing independent optimization of time‑to‑first‑token (TTFT) and time‑per‑token (TPOT).

Dynamic PD‑separation (DOPD) adds a short‑term load predictor and a lightweight controller that automatically adjusts the P/D instance ratio, scales resources, and mitigates producer‑consumer imbalance for mixed‑length requests.

2. Attention‑FFN Disaggregation (AFD)

Within the decode stage, the attention module is memory‑bound (few parameters, heavy KV traffic) while the feed‑forward network (FFN) is compute‑bound (large parameters). AFD places attention and FFN on different devices, further improving resource utilization and reducing hardware cost. Experiments on NVIDIA A800 and Ascend 910B GPUs show higher throughput and lower deployment cost compared with monolithic deployment.

3. Cross‑machine Expert Parallel (EP) Communication

MoE models with thousands of experts cannot fit all parameters on a single GPU. DeepEP (open‑source library from DeepSeek) provides sparse‑aware communication that transmits only the KV entries required by the activated experts, drastically reducing bandwidth and latency. It supports native FP8 precision and NVLink‑RDMA forwarding.

TRMT builds on DeepEP to optimize performance on RoCE networks, addressing the latency penalty of standard NCCL. Benchmarks on DeepSeek‑V3/R1 show significant bandwidth and latency improvements.

4. TileLang – Data‑flow‑centric Programming Model

TileLang decouples data flow from scheduling. Developers describe tensors and high‑level pipelines; the compiler automatically generates optimal thread‑block‑grid mappings. Three tiers are offered:

Entry‑level : only data flow, compiler handles all scheduling.

Intermediate : data flow plus callable operators for flexibility.

Fine‑grained : explicit thread‑level control for extreme performance.

A matrix‑multiplication example demonstrates that a few high‑level statements can produce an M×K × K×N multiplication with optimal performance. TileLang has been used to re‑implement several DeepSeek V3.2 operators.

5. RL Train‑Inference Separation (SeamlessFlow)

Training data for RL agents is generated by inference, creating a tight coupling that leads to resource bubbles and synchronization issues. SeamlessFlow introduces a trajectory manager that records every LLM‑Agent interaction (inputs, outputs, prompts) to provide a transparent data‑exchange plane.

It also uses a tag‑based resource scheduler: compute slots are labeled as “train” or “infer” and can be dynamically reassigned, eliminating idle compute periods.

6. Agent Infrastructure and Skills Security

Popular agent frameworks (AutoGPT, LangGraph, Dify, CrewAI, AutoGen) differ in autonomy and extensibility. Skills (Skill.md) encode procedural knowledge for agents but introduce attack surfaces:

Semantic hijacking – malicious skill descriptions can be triggered by ambiguous user commands.

Ghost commands – hidden instructions embedded in skill metadata.

Malicious scripts – Skills may invoke local Bash scripts with the agent’s privileges.

Over‑permissive tool activation – e.g., allowed‑tools: Bash can grant zero‑click RCE.

Recommendations:

Source‑code vetting of Skills.

Sandbox enforcement with fine‑grained permission models.

Runtime auditing and tool‑level access control (OAuth, token limits).

7. Production‑grade Agent Infra Requirements

Strong isolation to prevent code escape.

Millisecond‑scale environment provisioning.

State persistence across crashes.

Full observability and audit trails.

Tencent Cloud’s stack combines Cube sandbox (~80 ms latency), MVM snapshot + image pre‑warming (zero cold start), guest‑kernel isolation, and a multi‑sandbox VPC file system to achieve secure, low‑cost, serverless‑like operation.

8. Super‑Node Hardware Foundations

NVLink‑based super‑nodes (e.g., Vera Rubin NVL144) aggregate up to 144 GPUs with intra‑node bandwidth of 260 TB/s, enabling PD‑separated inference for ultra‑long‑context models (1 M+ tokens). Such high‑bandwidth domains are essential because context length growth causes quadratic compute demand and linear memory demand.

9. Model‑System Co‑design

Step‑3 introduces the notion of arithmetic intensity (FLOPs per byte of memory traffic). Matching a model’s arithmetic intensity to the hardware roofline (compute‑to‑bandwidth ratio) determines whether compute or memory bandwidth is the bottleneck. By designing the attention module to achieve arithmetic intensity close to the roofline of NVIDIA A800 or Ascend 910B, Step‑3 attains superior efficiency.

Key formulas (re‑expressed for clarity):

BatchSize_MoE ≥ FLOPs / (2 × S × Bandwidth)

S ≥ (H × FLOPs × L) / (Net × Bandwidth × β)

where S is sparsity, H hidden size, L number of layers, Net network bandwidth, and β a constant reflecting quantization precision and pipeline depth. Properly balancing these variables yields high‑throughput inference for long‑context MoE models.

10. Outlook

From 2024 to 2025, model capabilities improved roughly tenfold due to co‑design of models, systems, and data. The next wave (2026) will focus on finer‑grained engineering, cost‑effective hardware, and robust AI‑Agent infrastructure that brings “Jarvis‑like” assistants closer to reality.

References

DistServe: Disaggregating Prefill and Decoding for Goodput‑optimized Large Language Model Serving – https://arxiv.org/pdf/2401.09670

DOPD: A Dynamic PD‑Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving – https://arxiv.org/pdf/2511.20982

Step‑3 is Large yet Affordable: Model‑system Co‑design for Cost‑effective Decoding – https://arxiv.org/pdf/2507.19427

DeepEP – https://github.com/deepseek-ai/DeepEP

TileLang: A Composable Tiled Programming Model for AI Systems – https://arxiv.org/pdf/2504.17577

SeamlessFlow: A Trainer‑Agent Isolation RL Framework Achieving Bubble‑Free Pipelines via Tag Scheduling – https://arxiv.org/pdf/2508.11553

NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads – https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/

Distributed inference AI infrastructure GPU communication agent systems Moe optimization PD separation TileLang

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.