How Alibaba’s AI·OS Powers 10 Years of Search & Recommendation at Scale
Alibaba’s AI·OS, a decade‑old big‑data deep‑learning online serving platform, underpins the group’s search and recommendation services, delivering sub‑10‑second updates, supporting massive models, and integrating components like TPP, RTP, HA3, DII, and iGraph to drive efficient algorithm iteration and cloud‑scale innovation.
On September 28, Alibaba Search celebrated its tenth anniversary, marking a decade of a robust search and recommendation platform that supports Taobao, Tmall, Youku, and overseas e‑commerce, driving the majority of the group’s GMV. With the rise of the intelligent era, the platform has evolved into a big‑data deep‑learning online service system, maintaining a sub‑10‑second end‑to‑end update latency while enabling flexible splitting of deep‑learning networks, supporting multi‑terabyte models, heterogeneous and real‑time computing, and large‑scale training.
AI·OS Overview
AI·OS (Online Serving) is a ten‑year‑old big‑data deep‑learning online service framework built by Alibaba’s engineering, algorithm, and efficiency teams. It underpins all search and recommendation workloads across the Alibaba Group, serving e‑commerce, cloud, video, logistics, and more, and its cloud product matrix targets global developers with tens of millions of revenue.
Core Service Components
The system comprises five key service components:
TPP – Recommendation business platform
RTP – Deep‑learning prediction engine
HA3 – Search recall engine
DII – Recommendation recall engine
iGraph – Graph query engine
These components enable rapid composition and deployment of algorithmic flows via graph‑based operator pipelines, allowing online services to keep up with model training without lag.
Suez Framework and Hippo Scheduler
The Suez framework provides a unified abstraction for big‑data online services, guaranteeing second‑level data updates with strong consistency. It standardizes three dimensions: index storage (full‑text, graph, model), index management (full, incremental, real‑time), and service management (consistency, traffic shaping, scaling).
Hippo, the cluster resource scheduler, allocates mixed‑resource pools for training (PAI‑TF) and real‑time computation (Blink). At peak, it has run over 2,000 machines with thousands of CPU cores, delivering massive, free‑of‑charge compute capacity.
Integration with Blink and PAI
Blink, a general‑purpose real‑time computation engine, originated from AI·OS and now offers second‑level data updates with eventual consistency. PAI‑TF, after aligning with Hippo’s resource constraints, now handles all model training tasks for search and recommendation, and integrates with AI·OS’s graph execution engine.
Graph Computing and Future Directions
iGraph provides graph query capabilities, and the system’s graph‑based operator pipelines enable rapid experimentation. While classic offline graph computation is well studied, AI·OS pushes graph concepts into online services, demanding strict consistency and low‑latency updates.
Future plans include expanding Hippo’s boundaries (e.g., merging with Yarn), enhancing Suez’s capabilities, and delivering AI·OS‑based cloud products such as OpenSearch, ES, and the upcoming AIRec recommendation service.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
