How We Scaled WeChat AI Services with Ray: Lessons from Million‑Node Deployments
This article examines how WeChat’s Astra platform leverages the Ray distributed framework to manage million‑node AI workloads, addressing challenges of scale, heterogeneous GPU resources, operational complexity, and cost, and outlines the architecture that unifies Ray services across multiple Kubernetes clusters.
Background
WeChat has become essential in daily life, and with AI development it offers many AI computing services such as voice‑to‑text, AIGC in video channels, image recognition, etc. The massive user base means AI workloads are huge.
Why Ray?
To handle large‑scale AI tasks we built the Astra platform, which now runs many AI algorithm services (LLM, multimedia processing). Our main use case is Ray Serve. As a backend‑focused team we needed to bridge AI algorithm services and traditional micro‑services.
Key challenges
Scale : Traditional micro‑services run on a few thousand nodes, but AI services require tens of thousands of nodes and millions of CPU cores.
Resource diversity : AI services need GPUs of various brands (NVIDIA, ZhiXiao, Ascend), each requiring specific adapters.
Operations complexity : AI algorithms are pure compute services without business logic, often needing separate clusters per use case.
Cost : GPU hardware is expensive; reducing inference cost and improving utilization is critical.
Choosing Ray
Ray provides a unified distributed platform that integrates multiple compute models, forming a complete ecosystem, which simplifies development and resource management.
Adoption timeline
Since 2022 we have observed Ray’s advantages and, inspired by successful cases like ChatGPT, invested heavily to extend single‑machine applications to distributed environments.
Architecture
The Astra‑Ray architecture treats each Ray‑based application as a basic unit. It runs on a federated cluster that spans several internal Kubernetes clusters. Each K8s node runs our Starlink management agent, a P2P network‑penetration component, and the TFCC AI runtime.
Images illustrate the platform layout.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
