How AnalyticDB Powers Petabyte-Scale Consumer Analytics in Alibaba’s Data Bank
The article details how Alibaba’s Data Bank leverages AnalyticDB’s cold‑hot tiered storage, high‑throughput real‑time writes, and low‑latency OLAP capabilities to handle petabyte‑scale consumer data, support flexible AIPL analysis, crowd profiling, and rapid audience selection while cutting costs and ensuring elasticity during peak events.
Introduction
Data Bank is a commercial consumer‑operation data product that requires arbitrary‑dimensional analysis on massive datasets with strong response‑time guarantees. It stores tens of trillions of rows (≈1.6 PB) in AnalyticDB, achieving average query latency under 5 seconds.
Business Capabilities
Data Bank provides several core analytics functions:
Link‑flow analysis – uses the AIPL metric (Awareness, Interest, Purchase, Loyalty) to compare brand‑consumer relationships across any two dates within a 540‑day window, generating billions of possible dimension combinations.
Crowd profiling – builds over 200 industry‑specific tags for each consumer and can profile both static audiences and those derived from link‑flow changes.
Crowd selection – enables minute‑level audience selection using tags, behavioral touchpoints (purchase, search, live‑stream view, etc.), and instantly shows the resulting audience size at second‑level granularity.
Why AnalyticDB?
Traditional offline engines (Hadoop, Hive, MaxCompute) cannot meet the low‑latency needs of interactive analysis. Two typical solutions exist:
Pre‑compute all possible dimension combinations offline, then fetch results directly – suffers from dimension explosion and stale data.
OLAP‑style online computation on an MPP engine, keeping all dimensions available for ad‑hoc aggregation – requires a performant, cost‑effective storage layer.
Given the petabyte‑scale data volume and strict latency requirements, AnalyticDB was selected.
OLAP Engine Selection Challenges
Data volume : 1.6 PB of historical data, >22 trillion rows, with multiple tables exceeding a trillion rows.
Write throughput : 60 billion rows per day (≈10 TB) must be ingested within a 2‑hour window at ≥10 million TPS.
Complex query performance : Multi‑table joins on 20‑trillion‑row tables must return results within 10 seconds.
Export performance : Need to export millions of rows per minute to MaxCompute, supporting >20 concurrent export jobs.
Cost : Storing PB‑scale data on SSD is prohibitive; a cold‑hot tiered approach is required.
Stability : The system must handle mixed workloads with all challenges occurring simultaneously.
Key AnalyticDB Features Adopted
Cold‑hot data tiering : Tables can be designated as hot (ESSD), cold (OSS), or warm (mixed). AnalyticDB’s SSD cache accelerates cold‑table queries, reducing cost while preserving performance.
High‑throughput real‑time writes : Parallel architecture supports tens of millions to hundreds of millions TPS; batch load enables fast ingestion of pre‑aggregated tables.
High concurrency, low‑latency compute : Typical analytical queries (post‑selection aggregation and joins) average under 10 seconds thanks to caching and pre‑warming.
Data Model and Table Design
Four primary data categories are stored in AnalyticDB:
AIPL data – brand‑consumer relationship status.
Tag data – consumer attributes.
Touchpoint data – consumer behaviors.
Crowd data – persisted audience lists.
All tables are keyed by customer_id to minimize data shuffling.
AIPL Tables
AIPL tables are partitioned by day. Because each day generates >50 billion rows, the brand dimension is split into 20 sub‑tables to improve both ingestion and query performance.
Two designs were considered for supporting second‑level category dimensions:
Extend the existing AIPL table with a cate_id column.
Create a separate set of tables that include cate_id.
The second approach was chosen for better query performance despite a modest storage increase.
-- AIPL table without second‑level category
CREATE TABLE `aipl_[001-020]` (
`customer_id` bigint,
`brand_id` bigint,
`aipl_status` int,
`day` bigint
) DISTRIBUTE BY HASH(`customer_id`)
PARTITION BY VALUE(day)
CLUSTERED BY (`brand_id`,`aipl_status`);
-- AIPL table with second‑level category
CREATE TABLE `aipl_cate_[001-020]` (
`customer_id` bigint,
`brand_id` bigint,
`cate_id` bigint,
`aipl_status` int,
`day` bigint
) DISTRIBUTE BY HASH(`customer_id`)
PARTITION BY VALUE(day)
CLUSTERED BY (`brand_id`,`cate_id`,`aipl_status`);Tag Tables
Initially a key‑value model was used, but intersecting multiple tags in memory proved inefficient. AnalyticDB’s multivalue (or JSON) columns allow a wide‑column design where each tag becomes a separate column, enabling native AND/OR operations.
CREATE TABLE `tag` (
`customer_id` bigint,
`tag1` int,
`tag2` int,
`tag3` multivalue,
...
) DISTRIBUTE BY HASH(`customer_id`);Because the table contains over 200 columns, it is split into several thematic sub‑tables to avoid import bottlenecks and data bloat.
Crowd Selection Acceleration
Audience selection often involves dozens of sub‑queries with set operations. The process is broken into multiple query shards executed on AnalyticDB; the intermediate consumer‑ID lists are then merged in an ETL step, leveraging AnalyticDB’s indexing and cloud‑native elasticity to meet Double‑Eleven peak loads.
Business Value
AnalyticDB delivers three major benefits to Data Bank:
High‑performance OLAP engine : Handles 22 trillion rows (≈1.6 PB) with average query latency of 3‑5 seconds.
Significant cost reduction : Cold‑hot tiered storage and pay‑as‑you‑go pricing cut operational costs by ~46% compared with the previous generation.
Elastic capacity for large promotions : Cloud‑native scaling enables rapid resource expansion during peak events, ensuring stable, fast crowd‑selection and analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
