How Alibaba Built a Scalable Search Offline Platform for Billions of Records
This article explains how Alibaba's search offline platform combines massive batch and real‑time processing, leveraging Hadoop, HBase, Flink (Blink) and a unified full‑incremental model to handle tens of billions of daily records and millions of TPS for diverse business lines.
Background
Search offline data processing is a typical massive‑data batch/real‑time computing scenario. Alibaba Search middle‑platform team combined internal techniques with open‑source big‑data storage and compute systems to build a search offline platform that can process tens of billions of records per day and handle millions of TPS in real time.
What is Search Offline?
A typical product search architecture is shown below. The offline system processes data from various sources and feeds it into the online search engine. In Alibaba’s terminology, “online” services respond to user requests in milliseconds, while “offline” systems transform source data and load it into the online engine. Offline systems handle massive data volumes and complex business logic, featuring full‑ and incremental‑task models, support for diverse input/output data sources, and advanced data‑processing capabilities such as multi‑table joins and UDTFs.
Development Overview
From the early Taobao search stage (2008‑2012) using Hadoop and HBase, to the component‑and‑platform stage (since 2013) supporting dozens of business lines, the offline platform has evolved through several architectural iterations, improving performance, scalability and developer productivity.
Taobao Search Stage
During 2008‑2012 the platform introduced Hadoop/HBase for distributed processing, but the codebase was tightly coupled with Taobao business logic.
Component & Platform Stage
From 2013 onward the platform was refactored into reusable components (Maat, Bahamut, Blink, Soman, Catalog, Hippo, Swift) to serve many business units such as Fliggy, DingTalk, 1688, Lazada, etc., and to leverage advances in stream‑processing engines.
Offline Platform Architecture
Platform Components and Task Flow
The platform consists of:
Maat : a distributed task scheduler derived from Airflow, with performance optimizations, FaaS‑style executors, containerization and extended APIs.
Bahamut : the execution engine that creates, schedules and manages offline jobs.
Blink : Alibaba’s internal Flink fork, optimized for large‑scale batch, SQL and TableAPI workloads.
Soman : UI module for visualizing task status and creating applications.
Catalog : metadata service for table DDL and storage resource management.
Hippo : a Yarn‑like resource manager providing Docker management for online services.
Swift : a high‑throughput distributed message queue built on HDFS.
The overall data flow consists of three layers:
Data synchronization layer: mirrors full‑ and incremental‑data from source tables into internal HBase tables.
Data association layer: joins data across dimensions and applies UDTFs to produce output for the search engine.
Data interaction layer: stores the final full‑ and incremental‑datasets and interacts with the online build module.
Unified Full‑Incremental Compute Model
To hide platform details from users, the concept of a Business Table (an abstract table composed of a full‑table and/or an incremental stream table with identical schema) is introduced. Users design a Business Graph that connects Business Tables with processing components such as Join and UDTF. Bahamut translates the Business Graph → APP Graph → Job Graph → Blink/Maat jobs, performing correctness checks, task‑layer optimization, and generating the necessary Blink and Maat jobs.
Storage and Compute
HBase‑Based Storage
Since 2012 the platform uses HBase as the internal storage engine, leveraging its Scan/Get and bulk‑load capabilities to match the full‑/incremental model, its LSM‑Tree architecture for reliability, and its schema‑free design for evolving business data.
Flink‑Based Compute
From 2016 onward the platform adopted Flink (Blink) as the compute engine, unifying stream and batch processing via TableAPI and SQL. The latest Blink‑2.1.1 version supports native batch, SQL‑driven job description, and submission through the Bayes development platform, simplifying debugging and resource utilization.
-- Define a DRC source table (MySQL binlog stream)
CREATE TABLE DRCSource_1 (
`tag_id` VARCHAR,
`act_info_id` VARCHAR
) WITH (
tableFactoryClass='com.alibaba.xxx.xxx.DRCTableFactory',
-- other config
);
-- Define an HBase sink table
CREATE TABLE HbaseSink_1 (
`tag_id` VARCHAR,
`act_info_id` VARCHAR
) WITH (
class='com.alibaba.xxx.xxx.CombineSink',
hbase_tableName='bahamut_search_tmall_act',
-- other config
);
-- Insert logic
INSERT INTO HbaseSink_1 SELECT
`tag_id`,
`act_info_id`
FROM DRCSource_1;Conclusion
Alibaba’s search offline platform combines batch and real‑time processing to handle billions of records daily with million‑level TPS. It now serves over 200 internal business lines and is being extended to recommendation and advertising scenarios, with future plans to offer the capability as a cloud service on Alibaba Cloud.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
