Inside scaleX640: How China’s First 640‑Card Supernode Redefines AI Compute
The scaleX640 supernode, unveiled at the Wuzhen World Internet Conference, packs 640 AI accelerators into a single rack, delivering unprecedented compute density, energy efficiency, open ecosystem compatibility, and reliability features that enable massive AI model training and inference at scale.
China's Sugon unveiled the scaleX640 at the Wuzhen World Internet Conference, the world’s first single‑rack 640‑card supernode, showcasing innovations across hardware architecture, energy efficiency, ecosystem compatibility, and reliability.
Architecture Design: High‑Density Interconnect Breaks Compute Boundaries
The supernode uses a “one‑to‑two” high‑density architecture, enabling 640 accelerators to be interconnected via ultra‑fast bus within a single rack. A dedicated communication domain provides high bandwidth and low latency, and two racks can be combined to form a 1,280‑card “thousand‑card” computing unit, with inter‑rack high‑speed network expansion.
Energy‑Efficiency Optimization: Multi‑Tech Fusion for Extreme Efficiency
At the hardware level, an ultra‑dense blade design maximizes rack space usage. Cooling employs immersion phase‑change liquid cooling together with a liquid‑condensation heat exchange device (CDM), delivering 1.72 MW of cooling capacity to support stable operation of thousand‑card units. These combined technologies raise compute density by about 20× over the industry best, achieve a PUE of 1.04, and improve large‑model training and high‑throughput inference efficiency by 30‑40%.
Ecosystem Compatibility: Open Architecture Lowers Migration Costs
Built on an open AI compute architecture, the hardware supports multiple brands of AI accelerator cards, allowing flexible selection. The software stack is fully compatible with major AI frameworks and ecosystems, having already optimized over 400 mainstream large models, reducing both technical and time costs for platform migration.
Reliability Assurance: Multi‑Layer Design Supports Massive Deployments
Reliability is addressed at both hardware and software layers: the rack incorporates enhanced RAS features for fault tolerance, while the cluster integrates an intelligent operation‑and‑maintenance system and fault‑recovery mechanisms for rapid anomaly detection and handling. After more than 30 days of continuous stability testing, the product meets the needs of ultra‑large clusters with up to 100,000 cards, serving high‑availability AI compute centers and scientific research.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
