Full-Stack Software‑Hardware Co‑Design Redefines China's AI Compute Landscape
The 2026 HaiGuang AI Software Ecosystem Summit in Zhengzhou revealed a decisive industry shift from peak‑performance chip bragging to system‑level effective compute, emphasizing full‑stack software‑hardware collaboration, heterogeneous scheduling, and open architecture as the key to unlocking trillion‑parameter AI models.
Shift from peak‑chip performance to system‑level effective compute
At the 2026 HaiGuang AI Software Ecosystem Forum in Zhengzhou, participants from cloud providers, OS vendors, database vendors and large‑model companies emphasized that raw chip specifications no longer determine AI compute advantage. The competition now centers on “system‑level effective compute” and overall efficiency.
Full‑stack hardware‑software co‑design and heterogeneous scheduling
The meeting highlighted two technical themes:
Full‑stack co‑design : integrating low‑level software optimizations (operator tuning, compiler improvements, unified compute integration) with hardware to avoid situations where resources are “busy but cannot compute or migrate”.
Heterogeneous scheduling : orchestrating multiple chip types, interconnects and storage in a super‑node.
DCU software stack details
The DCU stack released earlier in the year provides the following components: DTK 26.04 – a mature compute library used in MLPerf tests that supported a stable 10‑trillion‑parameter model. DAS 1.8 – integrates more than 2000 operators.
Support for over 100 mainstream AI frameworks.
MLPerf benchmark results showed that these low‑level improvements, rather than merely stacking tens of thousands of DCU units, were the decisive factor for stability of trillion‑parameter workloads.
ScaleX ten‑thousand‑GPU super‑cluster case study
China’s Zhongke Shuguang deployed the ScaleX super‑cluster with ten thousand GPUs. The deployment demonstrated that simply adding hardware cannot meet trillion‑parameter demands. Success required an open architecture that:
Accepts multiple domestic chip brands.
Provides high‑speed interconnects for “super‑node” communication.
Enables compute‑storage co‑design.
Open‑bus protocol and software stack openness
The HSL open‑bus protocol was standardized, and the core software stacks DTK, DAS and DAP were released under open licenses. These efforts culminated in the AI Compute Open Architecture Joint Lab, which has invested more than 10 billion RMB in three years to address three dominant challenges of domestic AI servers:
Difficulty adapting existing workloads.
Poor heterogeneous compatibility.
Absence of a foundational software stack.
Impact of a unified software ecosystem
When the software layers are fully integrated, heterogeneous compute resources can be allocated on demand, allowing Chinese AI workloads to move from isolated, low‑efficiency pipelines to systematic, high‑performance assemblies.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
