Artificial Intelligence 21 min read

How We Scaled AI Compute to Millions of Nodes with Ray on WeChat

This article explains how Tencent's WeChat team built the Astra platform on Ray to manage millions of AI compute nodes, addressing challenges of massive scale, heterogeneous GPU resources, low‑priority node instability, deployment complexity, and cost, while detailing architecture, scheduling strategies, and practical usage examples.

DataFunSummit

Aug 28, 2025

How We Scaled AI Compute to Millions of Nodes with Ray on WeChat

Background

WeChat has become an essential daily platform, offering AI services such as voice‑to‑text, AIGC video, and image recognition. The massive user base creates an equally massive demand for AI compute.

Why Ray?

Traditional micro‑services handle only a few thousand nodes, but AI workloads require hundreds of thousands of nodes and millions of CPU cores. Ray provides a unified distributed platform that supports diverse compute models and simplifies scaling.

Architecture Overview

The Astra platform treats each Ray‑based application as a unit. A federated cluster layer connects multiple internal Kubernetes clusters, each running a Starlink agent, P2P download component, and TFCC runtime. The Resource layer aggregates heartbeats from all nodes, enabling management of millions of pods and efficient scheduling.

Scheduling Strategies

Three scheduler types are discussed:

Single‑layer (K8s‑like) – limited concurrency, suitable for small clusters.

Two‑layer – higher concurrency by pre‑allocating resources to an upper scheduler.

Shared scheduling (inspired by Google Omega) – unlimited schedulers, optimistic resource allocation, ideal for hundreds of thousands of nodes.

We adopt shared scheduling in Astra‑Ray to meet WeChat's AI compute demands.

Low‑Priority Resource Management

Low‑priority nodes are abundant but unstable. We use Kubernetes PreStop hooks and per‑second heartbeats to quickly remove unhealthy nodes, and dynamically adjust routing weights based on node performance, ensuring stable service despite resource churn.

Ray Federation and Fault Tolerance

Each Ray cluster acts as an independent service unit. Faults in head or worker nodes trigger automatic replacement or removal, and the system supports vertical scaling of workers to handle large‑scale deployments.

Astra‑Ray Deployment Steps

Modify code to use Ray Serve.

Package the code, select a Python version, and upload to a Git repository.

Adjust Ray deployment configuration with gray‑release support.

Scale the service; the platform shows thousands of requesters across many nodes.

Additional features include a Ray Dashboard for monitoring and built‑in log debugging.

Q&A Highlights

Addressed topics such as handling performance fluctuations of low‑priority nodes, the impact of per‑second heartbeat frequency, cross‑cluster scheduling mechanisms, fault recovery for large numbers of replicas, and the current lack of distributed memory sharing between Ray clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Computing WeChat cluster management Ray AI scaling

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.