How Baidu Tianchi Supernodes Supercharge Large‑Model Inference: Architecture, Deployment, and Optimization

This article details Baidu's Tianchi supernode design and software tuning—covering hardware scale‑up, deployment planning, Prefill and Decode stage optimizations, quantization strategies, and communication schemes—to dramatically boost large‑model inference throughput and latency while lowering token‑cost.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How Baidu Tianchi Supernodes Supercharge Large‑Model Inference: Architecture, Deployment, and Optimization

Background

Supernodes are a scale‑up hardware architecture that interconnects 32 Kunlun‑XPU cards into a fully‑connected domain, eliminating the interconnect bottleneck of traditional 8‑card nodes. This provides extremely high inter‑node bandwidth and a unified memory pool, enabling large‑model inference with lower first‑token latency (TTFT) and per‑token processing time (TPOT).

Deployment Design for DeepSeek‑R1

The DeepSeek‑R1 model is deployed on Baidu Tianchi supernodes using the SGLang inference framework and a PD‑separated architecture. Parallelism is tuned to satisfy SLA constraints while maximizing hardware utilization.

Prefill stage: 16‑card compute units with Tensor Parallelism = 4 (TP4) and Sequence Parallelism = 4 (SP4) keep TTFT < 1 s and increase batch size.

Ultra‑long sequences (128 K): Larger parallelism (TP16 + SP16 or TP32 + SP32) reduces TTFT by >5× and improves per‑card throughput by ~80%.

Decode stage: 32‑card compute unit (one P900 node) uses TP1 (no tensor split) and Expert Parallel = 32 (EP32) to avoid costly AllReduce communication and to maximize memory efficiency.

Prefill Optimizations

1. Overlap (dual‑stream) Optimization

For a 4K × 2 request, the input is split into two 4K micro‑batches. While the first micro‑batch performs attention/MLP computation, the second micro‑batch executes the dispatch/combine communication. This interleaving hides communication latency and yields ~20% higher throughput.

2. Communication Scheme Selection

Scheme 1: No sequence split for q_down/kv_down, leading to heavy redundancy and communication volume of BS × 16128.

Scheme 2: TP + SP with AllGather + ReduceScatter, communication volume BS × 5772.

Scheme 3 (adopted): Adds an AlltoAll after the MHA block, achieving the lowest communication volume BS × 3468.

3. Long‑Sequence Prefill

Using TP16 + SP16 or TP32 + SP32 reduces per‑card compute load, shortens TTFT by a factor of five for 128 K inputs, and increases chunk size, resulting in ~80% higher per‑card throughput.

Decode Optimizations

1. Model Overlap (dual‑stream) Optimization

Decode batches are small and communication already occupies <15% of layer time, so splitting the batch provides little benefit. Instead, independent operators are overlapped: the combine operator on the cluster runs concurrently with shared‑expert computation on the SDNN, using two streams.

2. Operator Fusion and Simplification

Fuse q_k_up_absorb and v_up_absorb into a single fully‑connected batch, reducing their execution share from 7.4%/6.9% to 5.8%/4.9%.

Remove unnecessary memcpy / reshape around FlashAttention and eliminate the post‑FA quant‑cast, streamlining the GEMM path.

Autotune GEMM kernels for varying batch sizes and sequence lengths to keep the kernel optimal.

These changes lower attention latency by 11.5% and overall TPOT by ~8%.

3. Expert Parallel (EP) Communication Optimizations

Automatically select the optimal send strategy for combine/dispatch based on EP scale and token count.

During the reduce phase, drop redundant quantization and storage steps, cutting communication overhead from 40% to 15% of layer time.

4. Gap (Idle‑Period) Optimizations

MTP Gap: Split the verify step, moving H2D/D2H‑heavy parts after the MTP operation, allowing earlier launch of verify1 and MTP to hide latency (≈5% TPOT reduction).

Batch Gap: Pre‑launch the next request’s XPU operators while the current request’s batch is still executing, fully overlapping CPU batch preparation with XPU execution (≈10% TPOT reduction).

Results

Prefill per‑card throughput: 4.1 K tokens (45% higher than the baseline P800).

Decode per‑card throughput: 1.006 K tokens, achieving a 168% improvement; TPOT ≈ 47.6 ms.

Key Takeaways

The three‑stage optimization—PD‑separated architecture, hardware scaling to 32‑card supernodes, and deep software tuning—demonstrates that large‑scale scale‑up hardware combined with fine‑grained parallelism and communication engineering can dramatically improve inference performance for massive language models. Future work includes scaling to 512‑card supernodes for trillion‑parameter models, further reducing token‑cost, and continuous software enhancements for upcoming model releases.

Performance optimizationlarge-model inferenceAI infrastructureparallelismsupernode architecture
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.