How We Scaled AI Compute to Millions of Nodes with Ray on WeChat
This article explains how Tencent's WeChat team built the Astra platform on Ray to manage millions of AI compute nodes, addressing challenges of massive scale, heterogeneous GPU resources, low‑priority node instability, deployment complexity, and cost, while detailing architecture, scheduling strategies, and practical usage examples.
Background
WeChat has become an essential daily platform, offering AI services such as voice‑to‑text, AIGC video, and image recognition. The massive user base creates an equally massive demand for AI compute.
Why Ray?
Traditional micro‑services handle only a few thousand nodes, but AI workloads require hundreds of thousands of nodes and millions of CPU cores. Ray provides a unified distributed platform that supports diverse compute models and simplifies scaling.
Architecture Overview
The Astra platform treats each Ray‑based application as a unit. A federated cluster layer connects multiple internal Kubernetes clusters, each running a Starlink agent, P2P download component, and TFCC runtime. The Resource layer aggregates heartbeats from all nodes, enabling management of millions of pods and efficient scheduling.
Scheduling Strategies
Three scheduler types are discussed:
Single‑layer (K8s‑like) – limited concurrency, suitable for small clusters.
Two‑layer – higher concurrency by pre‑allocating resources to an upper scheduler.
Shared scheduling (inspired by Google Omega) – unlimited schedulers, optimistic resource allocation, ideal for hundreds of thousands of nodes.
We adopt shared scheduling in Astra‑Ray to meet WeChat's AI compute demands.
Low‑Priority Resource Management
Low‑priority nodes are abundant but unstable. We use Kubernetes PreStop hooks and per‑second heartbeats to quickly remove unhealthy nodes, and dynamically adjust routing weights based on node performance, ensuring stable service despite resource churn.
Ray Federation and Fault Tolerance
Each Ray cluster acts as an independent service unit. Faults in head or worker nodes trigger automatic replacement or removal, and the system supports vertical scaling of workers to handle large‑scale deployments.
Astra‑Ray Deployment Steps
Modify code to use Ray Serve.
Package the code, select a Python version, and upload to a Git repository.
Adjust Ray deployment configuration with gray‑release support.
Scale the service; the platform shows thousands of requesters across many nodes.
Additional features include a Ray Dashboard for monitoring and built‑in log debugging.
Q&A Highlights
Addressed topics such as handling performance fluctuations of low‑priority nodes, the impact of per‑second heartbeat frequency, cross‑cluster scheduling mechanisms, fault recovery for large numbers of replicas, and the current lack of distributed memory sharing between Ray clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
