Industry Insights 5 min read

Full-Stack Software‑Hardware Co‑Design Redefines China's AI Compute Landscape

The 2026 HaiGuang AI Software Ecosystem Summit in Zhengzhou revealed a decisive industry shift from peak‑performance chip bragging to system‑level effective compute, emphasizing full‑stack software‑hardware collaboration, heterogeneous scheduling, and open architecture as the key to unlocking trillion‑parameter AI models.

Architects' Tech Alliance

Apr 24, 2026

Full-Stack Software‑Hardware Co‑Design Redefines China's AI Compute Landscape

Shift from peak‑chip performance to system‑level effective compute

At the 2026 HaiGuang AI Software Ecosystem Forum in Zhengzhou, participants from cloud providers, OS vendors, database vendors and large‑model companies emphasized that raw chip specifications no longer determine AI compute advantage. The competition now centers on “system‑level effective compute” and overall efficiency.

Full‑stack hardware‑software co‑design and heterogeneous scheduling

The meeting highlighted two technical themes:

Full‑stack co‑design : integrating low‑level software optimizations (operator tuning, compiler improvements, unified compute integration) with hardware to avoid situations where resources are “busy but cannot compute or migrate”.

Heterogeneous scheduling : orchestrating multiple chip types, interconnects and storage in a super‑node.

DCU software stack details

The DCU stack released earlier in the year provides the following components: DTK 26.04 – a mature compute library used in MLPerf tests that supported a stable 10‑trillion‑parameter model. DAS 1.8 – integrates more than 2000 operators.

Support for over 100 mainstream AI frameworks.

MLPerf benchmark results showed that these low‑level improvements, rather than merely stacking tens of thousands of DCU units, were the decisive factor for stability of trillion‑parameter workloads.

ScaleX ten‑thousand‑GPU super‑cluster case study

China’s Zhongke Shuguang deployed the ScaleX super‑cluster with ten thousand GPUs. The deployment demonstrated that simply adding hardware cannot meet trillion‑parameter demands. Success required an open architecture that:

Accepts multiple domestic chip brands.

Provides high‑speed interconnects for “super‑node” communication.

Enables compute‑storage co‑design.

Open‑bus protocol and software stack openness

The HSL open‑bus protocol was standardized, and the core software stacks DTK, DAS and DAP were released under open licenses. These efforts culminated in the AI Compute Open Architecture Joint Lab, which has invested more than 10 billion RMB in three years to address three dominant challenges of domestic AI servers:

Difficulty adapting existing workloads.

Poor heterogeneous compatibility.

Absence of a foundational software stack.

Impact of a unified software ecosystem

When the software layers are fully integrated, heterogeneous compute resources can be allocated on demand, allowing Chinese AI workloads to move from isolated, low‑efficiency pipelines to systematic, high‑performance assemblies.

software-hardware co-design China AI ecosystem AI compute MLPerf open architecture heterogeneous scheduling

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.