Scaling Alibaba TCC to Millions of RPS with a High‑Availability Real‑Time Data Warehouse
This article details how Alibaba's TCC platform evolved its architecture over multiple phases—from a legacy database to a high‑availability real‑time data warehouse built on Flink and Hologres—highlighting the challenges, solutions, and cost‑saving measures that enabled millions of RPS, terabytes of storage, and sub‑second query latency.
Introduction
Alibaba TCC (Taocai Cai) is a community group‑buying service that has been operating for over six years. To support rapid business growth, TCC migrated to a high‑availability real‑time data warehouse (Hologres 2.0) in 2020, handling tens of millions of RPS writes, hundreds of terabytes of data, and second‑level query responses.
1 TCC Business Overview
TCC originated from Hema Select and Retail‑Tong, merging in early 2021. It serves fresh produce and daily goods via next‑day delivery across multiple channels (WeChat, Taobao, Alipay). Daily active users reach tens of millions, and data volume is on the order of hundreds of terabytes.
2 Evolution Timeline: From Database to High‑Availability Real‑Time Warehouse
The architecture progressed through four stages:
Stage 1 (pre‑2016): jstorm‑based real‑time screen and reporting.
Stage 2 (2016‑2020): Flink SQL with growing real‑time tasks (500+).
Stage 3 (2020): Flink + Hologres for unified real‑time storage.
Stage 4 (2021): Flink + Hologres with read‑write separation and high availability.
Stage 1 – jstorm
Initial real‑time warehouse for internal dashboards, supporting fewer than 10 real‑time jobs with ~3,000 CU resources.
Stage 2 – FlinkSQL
Scaled to support real‑time screens, marketing analysis, and channel analytics, increasing real‑time jobs to over 500 and resource usage by 30%.
Stage 3 – Flink + Hologres
Addressed architecture bloat and operational issues by adopting Hologres, enabling unified storage, faster debugging, and longer data lifecycles.
Stage 4 – High‑Availability (2021)
Focused on stability and cost governance, introducing primary‑secondary write paths, multi‑instance isolation, and resource‑saving measures that saved over 2 million CU in 2021.
3 Typical Real‑Time Warehouse Use Cases
Executive real‑time reporting: low‑latency, high‑reliability dashboards for senior management.
Logistics real‑time operations: continuous online queries and long‑term consumption.
Metric center & self‑service analysis: ad‑hoc multi‑dimensional reporting.
Growth platform (experiments & audience selection): multi‑dimensional analysis on billions of rows.
Daily operational reports (ShuLaiBao): fast delivery of historical and real‑time data.
4 Review of the Legacy Architecture and Its Problems
The pre‑2020 architecture was fragmented, leading to data inconsistency, low development efficiency, high O&M cost, long rollback times, and wasted real‑time resources.
5 First Architecture Upgrade: Flink + Hologres 1.0
5.1 Unified Data Warehouse Design
Data sources were unified via TT (Datahub). CDM and ADS layers were replaced with Hologres row‑store tables, enabling direct binlog subscription, simplified debugging, and longer data retention.
5.2 Challenges and Solutions
Performance bottlenecks from unrestricted Hologres usage were mitigated by optimizing table groups, shard counts, and indexes, and by isolating critical workloads.
Metric consistency issues caused by post‑processing were addressed through data monitoring, a centralized metric center, and pushing high‑frequency logic into Hologres.
5.3 Benefits
Unified architecture covering all real‑time scenarios.
Higher development efficiency with SQL‑first approach.
Reduced layers, faster rollbacks, and longer data lifecycles.
Strong scalability for OLAP and point queries.
Seamless integration of real‑time and offline data.
6 Second Architecture Upgrade: Flink + Hologres 2.0 (High‑Availability)
6.1 Enterprise‑Level Stability and Cost Governance Needs
Issues included configuration drift, unstandardized table designs, resource waste, and long recovery times.
6.2 Solutions
Implemented a primary‑secondary chain for the public layer (Shanghai TT primary, Zhangbei backup) and Hologres row‑store instances for the application layer (1 primary, 3 replicas) with same‑city disaster recovery.
Adopted strict development standards, rapid incident response (3‑5‑10‑30 minute SLA), and continuous post‑mortem improvements.
Cost‑saving measures involved cleaning unused tables, setting appropriate lifecycles, and optimizing binlog usage.
6.3 Benefits
Achieved true read‑write separation, rapid failover, and significant stability improvements, allowing teams to focus on business development.
7 Next‑Generation Real‑Time Warehouse: Efficiency and Cost Reduction
Supports >1 billion rows with stable query performance.
Row‑store for point queries, column‑store for OLAP, materialized views, and binlog for downstream consumption.
Rapid delivery via view‑based or FBI data sets.
Monthly Hologres incidents dropped from ~5 to <1, and instance restart time reduced from >50 minutes to ~20 minutes, achieving 99% continuous availability.
Cost reductions of several million RMB were realized by eliminating unused tables and optimizing storage.
8 Future Outlook
Further stability through stricter Hologres governance, multi‑replica shards, and enhanced risk assessment.
Higher efficiency by exploring materialized views, schemaless designs, and row‑column coexistence to cover 100% of OLAP use cases.
Continued cost control via dynamic scaling, storage‑compute governance, and lifecycle management.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
