Why Large‑Model Deployment Stalls: Robots, Scaling Laws, and Multimodal Frontiers
The article analyzes current challenges in deploying large AI models, covering robot automation, scaling‑law limits, vertical‑domain use cases, multimodal breakthroughs, algorithmic evolution, and the hardware‑software trade‑offs of training and inference infrastructures, while questioning ROI and practical feasibility.
1. Large‑Model Deployment Discussion
Qiming Ventures observes that micro‑processors have reduced compute marginal cost to near zero, the Internet has made information distribution free, and AI promises zero‑cost creation, but the last claim splits into two stages: Step‑1 generation of text, images, video, and Step‑2 multi‑step decision tasks. The author questions whether the output truly saves or kills time.
1.1 Robotics
Robotics hype this year stems from two advances: LLM‑driven instruction following and mature RL‑servo integration, bringing costs to an acceptable range. However, B‑side production line upgrades may be viable, while C‑side humanoid robots remain a short‑term gimmick because current models lack trustworthy multi‑step decision ability, leaving the commercial loop open.
1.2 Scaling‑Law Ceiling
Both domestic and overseas believe model parameters can still grow two orders of magnitude to around 100 T, yet training data saturates near 15 T, making synthetic data crucial. Scaling to 100 T also raises infrastructure size and power‑consumption challenges, and inference ROI becomes a pressing question.
1.3 Vertical‑Domain Models
State‑owned enterprises showcase vertical models for industry/manufacturing, improving societal efficiency but offering weak ROI for private firms. Commercial players focus on finance and healthcare; the author notes that current LLMs cannot meet financial time‑series analysis or multi‑step decision needs, and a single error can destroy trust.
1.4 Multimodal Generation
Recent demos combine video with audio; companies like Step and SenseTime demonstrate strong 1 T‑scale MoE multimodal perception, yet the impact on inference systems and required infrastructure remains an open research area.
2. Algorithmic Evolution
2.1 Multimodal Revives Computer Vision
When RNN/LSTM efficiency limited NLP, ChatGPT shifted focus to language; now multimodal advances bring CV back to the spotlight, with Step and SenseTime delivering impressive products and opening opportunities for video synthesis and physical simulation.
2.2 Grey‑Box Models
Professor Qi Yuan presented a "trustworthy light‑language model" with billions of parameters, highlighting the emerging grey‑box paradigm that blends black‑box LLMs with interpretable white‑box components.
The CEO of Step also discusses multimodal understanding and future System‑2 planning and abstraction.
Decoder‑only models are pure black boxes; a 1 T model may implicitly encode physical world knowledge, but a complementary white‑box component is needed for logical reasoning, an idea the author relates to recent readings on probabilistic reasoning.
These insights motivate the exploration of Adaptive Sparse‑GNN‑AutoEncoder architectures, encouraging foundation‑model vendors to open‑source SAE data for academic research.
3. Training Infrastructure
The author references a recent article "At a Crossroad in AI Scale‑Up" that proposes a layered Scale‑UP logic, advocating Ethernet‑based Scale‑UP and criticizing asymmetric topologies. Multi‑die MCM‑GPU memory‑semantic interconnects from NVIDIA are highlighted as promising, while future infrastructure should adopt disaggregated architectures and heterogeneous compute.
4. Inference Infrastructure
Building on a Zhihu article about LLM split‑inference, the author discusses software‑hardware co‑design, emphasizing differences between training and inference workloads.
4.1 Differences Between Training and Inference Systems
Arrival rate and service rate are deterministic in training systems.Data arrives in batches, backward synchronization makes compute time predictable, and padding can be optimized.
Arrival rate follows a Poisson distribution; service rate depends on implementation and scheduling in inference systems.Request arrivals are modeled as Poisson, token distributions affect service time, and prefill‑decoder behavior influences scheduling complexity.
4.2 Software Architecture
The control plane handles request latency prediction, scheduling, cluster management, and high‑availability cache, typically on CPUs. The data plane manages prefill and decode nodes and elastic memory pool movements, borrowing ideas from recommendation‑system hierarchical parameter servers.
LLM KV‑Cache handling differs from embedding table lookup, requiring longest‑prefix match logic and adjustments in CPU memory and SSD software stacks.
4.3 Storage and Memory Pool Design
DRAM scarcity forces SSD usage; a distributed elastic memory pool in front of SSDs is essential. The author cites the MemServe paper (https://arxiv.org/abs/2406.17565) as a reference.
DeepSeek stores user context on SSD for 24 hours, suggesting a similar design for long‑term KV‑Cache persistence. Trie or Tree‑Bitmap structures can index KV‑Cache entries, enabling parallel token‑based workload distribution and asynchronous DMA/RDMA transfers.
4.4 Hardware Architecture
Three‑network convergence (frontend VPC Ethernet, backend IB/RoCE, and NVLink super‑node) is argued to be the future optimal AI network.
Current AI networks consist of three independent nets: frontend storage VPC (Ethernet), backend parameter plane (IB, RoCE2), and super‑node (NVLink, HCCS). Maintaining all three long‑term is unreasonable; they will eventually merge.
Prefill‑Decode M:N deployment requires high bi‑section bandwidth between H800/A800 and H20 GPUs, while avoiding hash collisions that degrade utilization.
Scale‑UP considerations include whether additional Load/Store mechanisms are needed; the author suggests that fine‑grained load/store may be unnecessary for most inference workloads, with existing solutions like NVIDIA GPS/PROACT/Fine‑PACK sufficing.
Elastic scaling of prefill and decode instances is highlighted as a key research direction, with cost‑effective token‑based pricing and memory‑pool elasticity shaping future services.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
