Why AI Infrastructure Must Be Close to Models and Hardware – Insights from Zhu Yibo
In a WAIC 2025 interview, Zhu Yibo, co‑founder of Jiejie Xingchen, shares deep insights on AI infrastructure, covering its evolution, the need for tight model‑hardware co‑design, cost‑efficiency metrics, industry challenges, and future directions for large‑scale AI systems.
Overview of AI Infrastructure (AI Infra)
AI Infra is the engineering stack that sits between large‑scale AI models and the underlying hardware (primarily GPUs and AI‑specific chips). Unlike traditional infrastructure, which focuses on CPUs and generic workloads, AI Infra must be tightly coupled to both model characteristics and hardware capabilities to achieve high compute utilization.
Three‑layer Stack (analogy to cloud)
IaaS : Physical servers, GPU/AI‑chip cards, networking (switches, NICs) and large‑scale storage. It provides the three fundamental resources – compute, communication, storage.
PaaS : Cluster schedulers, resource‑management services, and model‑service orchestration platforms.
SaaS : Optimized training and inference frameworks, kernel‑level GPU kernels, and model‑specific runtime optimizations.
Key Technical Concepts
Model‑hardware co‑design : Designing models that exploit the specific instruction set, memory hierarchy, and parallelism of the target chip. Current practice is dominated by NVIDIA GPUs; domestic chips lag in performance‑per‑dollar, so a dedicated co‑design effort is required to close the gap.
Mixture‑of‑Experts (MoE) : Early adoption of MoE gives Infra teams strategic influence because MoE reduces inference cost while scaling model capacity. Infra teams must provide the scheduling and routing logic that keeps expert activation efficient.
Reinforcement Learning (RL) impact : RL changes the whole stack – hardware selection, system architecture, and model design – because RL training requires fast, low‑latency decoding to generate reward signals.
Performance Metrics
MFU (Model FLOPs Utilization) :
MFU = (actual FLOPs performed) / (theoretical FLOPs)Higher MFU means the hardware is used closer to its peak capacity.
Cost‑efficiency vs. model performance curve : Infra teams should plot cost/efficiency on the x‑axis and model quality (e.g., perplexity, downstream task score) on the y‑axis, rather than only parameter count.
Decoding speed (output latency) is now the primary metric for production systems because it directly determines user‑perceived cost and RL training throughput.
Economic Impact Example
Assume a fleet of 10,000 GPUs with a monthly rental cost of ¥1 billion. Improving utilization by 10 % saves ¥100 million per month, easily covering the salary of a small Infra team. Smaller companies perform a similar ROI analysis to decide whether to hire dedicated Infra engineers or rely on public‑cloud baselines.
Model‑Hardware Co‑design Challenges
Most public models are optimized for NVIDIA GPUs; achieving comparable efficiency on domestic chips requires redesigning model architectures (e.g., adjusting attention patterns, quantization schemes).
Full co‑design – where hardware designers and model architects iterate together – is rare outside of large firms (Google, OpenAI) that have the resources to build custom ASICs.
Visual‑Reasoning Model Release
A new 100‑billion‑parameter visual‑reasoning model operates directly on images without converting them to text, enabling end‑to‑end tasks such as robotic manipulation. The model is open‑sourced, weights are freely licensed for domestic chips, and the inference stack has been tuned to make the cost competitive with NVIDIA‑based solutions.
Industry Landscape
Companies like Snowflake and Databricks are fundamentally data‑management platforms; their AI‑Infra offerings are extensions rather than core AI‑Infra products.
Third‑party AI‑Infra startups (e.g., CoreWeave, domestic firms such as 无问芯穹, 潞晨科技) focus mainly on inference acceleration; training‑as‑a‑service remains limited because training pipelines are tightly coupled with proprietary models.
Strategic Advice for Infra Professionals
Stay close to both model development and hardware engineering; develop the ability to influence hardware roadmaps and model architecture decisions.
Prioritize the primary metric of your workload (e.g., decoding latency for RL, MFU for training) and align optimization efforts accordingly.
Balance deep systems experience with fresh ideas – senior engineers provide stability, while newcomers can introduce novel co‑design approaches.
Future Outlook
Major paradigm shifts have occurred roughly every two years (GPT‑3.5 / 2022, o1 / 2024); the next major change is expected around 2026.
Multi‑modal models are likely to converge on unified architectures that excel at both understanding and generation, similar to the transition from BERT‑only to GPT‑style models.
Continued improvements in decoding speed and compute utilization will remain the decisive factors for competitive advantage.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
