Industry Insights 21 min read

Why Large‑Model Deployment Stalls: Robots, Scaling Laws, and Multimodal Frontiers

The article analyzes current challenges in deploying large AI models, covering robot automation, scaling‑law limits, vertical‑domain use cases, multimodal breakthroughs, algorithmic evolution, and the hardware‑software trade‑offs of training and inference infrastructures, while questioning ROI and practical feasibility.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Large‑Model Deployment Stalls: Robots, Scaling Laws, and Multimodal Frontiers

1. Large‑Model Deployment Discussion

Qiming Ventures observes that micro‑processors have reduced compute marginal cost to near zero, the Internet has made information distribution free, and AI promises zero‑cost creation, but the last claim splits into two stages: Step‑1 generation of text, images, video, and Step‑2 multi‑step decision tasks. The author questions whether the output truly saves or kills time.

1.1 Robotics

Robotics hype this year stems from two advances: LLM‑driven instruction following and mature RL‑servo integration, bringing costs to an acceptable range. However, B‑side production line upgrades may be viable, while C‑side humanoid robots remain a short‑term gimmick because current models lack trustworthy multi‑step decision ability, leaving the commercial loop open.

1.2 Scaling‑Law Ceiling

Both domestic and overseas believe model parameters can still grow two orders of magnitude to around 100 T, yet training data saturates near 15 T, making synthetic data crucial. Scaling to 100 T also raises infrastructure size and power‑consumption challenges, and inference ROI becomes a pressing question.

1.3 Vertical‑Domain Models

State‑owned enterprises showcase vertical models for industry/manufacturing, improving societal efficiency but offering weak ROI for private firms. Commercial players focus on finance and healthcare; the author notes that current LLMs cannot meet financial time‑series analysis or multi‑step decision needs, and a single error can destroy trust.

1.4 Multimodal Generation

Recent demos combine video with audio; companies like Step and SenseTime demonstrate strong 1 T‑scale MoE multimodal perception, yet the impact on inference systems and required infrastructure remains an open research area.

2. Algorithmic Evolution

2.1 Multimodal Revives Computer Vision

When RNN/LSTM efficiency limited NLP, ChatGPT shifted focus to language; now multimodal advances bring CV back to the spotlight, with Step and SenseTime delivering impressive products and opening opportunities for video synthesis and physical simulation.

2.2 Grey‑Box Models

Professor Qi Yuan presented a "trustworthy light‑language model" with billions of parameters, highlighting the emerging grey‑box paradigm that blends black‑box LLMs with interpretable white‑box components.

The CEO of Step also discusses multimodal understanding and future System‑2 planning and abstraction.

Decoder‑only models are pure black boxes; a 1 T model may implicitly encode physical world knowledge, but a complementary white‑box component is needed for logical reasoning, an idea the author relates to recent readings on probabilistic reasoning.

These insights motivate the exploration of Adaptive Sparse‑GNN‑AutoEncoder architectures, encouraging foundation‑model vendors to open‑source SAE data for academic research.

3. Training Infrastructure

The author references a recent article "At a Crossroad in AI Scale‑Up" that proposes a layered Scale‑UP logic, advocating Ethernet‑based Scale‑UP and criticizing asymmetric topologies. Multi‑die MCM‑GPU memory‑semantic interconnects from NVIDIA are highlighted as promising, while future infrastructure should adopt disaggregated architectures and heterogeneous compute.

4. Inference Infrastructure

Building on a Zhihu article about LLM split‑inference, the author discusses software‑hardware co‑design, emphasizing differences between training and inference workloads.

4.1 Differences Between Training and Inference Systems

Arrival rate and service rate are deterministic in training systems.

Data arrives in batches, backward synchronization makes compute time predictable, and padding can be optimized.

Arrival rate follows a Poisson distribution; service rate depends on implementation and scheduling in inference systems.

Request arrivals are modeled as Poisson, token distributions affect service time, and prefill‑decoder behavior influences scheduling complexity.

4.2 Software Architecture

The control plane handles request latency prediction, scheduling, cluster management, and high‑availability cache, typically on CPUs. The data plane manages prefill and decode nodes and elastic memory pool movements, borrowing ideas from recommendation‑system hierarchical parameter servers.

LLM KV‑Cache handling differs from embedding table lookup, requiring longest‑prefix match logic and adjustments in CPU memory and SSD software stacks.

4.3 Storage and Memory Pool Design

DRAM scarcity forces SSD usage; a distributed elastic memory pool in front of SSDs is essential. The author cites the MemServe paper (https://arxiv.org/abs/2406.17565) as a reference.

DeepSeek stores user context on SSD for 24 hours, suggesting a similar design for long‑term KV‑Cache persistence. Trie or Tree‑Bitmap structures can index KV‑Cache entries, enabling parallel token‑based workload distribution and asynchronous DMA/RDMA transfers.

4.4 Hardware Architecture

Three‑network convergence (frontend VPC Ethernet, backend IB/RoCE, and NVLink super‑node) is argued to be the future optimal AI network.

Current AI networks consist of three independent nets: frontend storage VPC (Ethernet), backend parameter plane (IB, RoCE2), and super‑node (NVLink, HCCS). Maintaining all three long‑term is unreasonable; they will eventually merge.

Prefill‑Decode M:N deployment requires high bi‑section bandwidth between H800/A800 and H20 GPUs, while avoiding hash collisions that degrade utilization.

Scale‑UP considerations include whether additional Load/Store mechanisms are needed; the author suggests that fine‑grained load/store may be unnecessary for most inference workloads, with existing solutions like NVIDIA GPS/PROACT/Fine‑PACK sufficing.

Elastic scaling of prefill and decode instances is highlighted as a key research direction, with cost‑effective token‑based pricing and memory‑pool elasticity shaping future services.

Roboticslarge modelsmultimodalalgorithm evolutiontraining infrastructureinference infrastructure
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.