Big Data 21 min read

Scaling Alibaba TCC to Millions of RPS with a High‑Availability Real‑Time Data Warehouse

This article details how Alibaba's TCC platform evolved its architecture over multiple phases—from a legacy database to a high‑availability real‑time data warehouse built on Flink and Hologres—highlighting the challenges, solutions, and cost‑saving measures that enabled millions of RPS, terabytes of storage, and sub‑second query latency.

Alibaba Cloud Big Data AI Platform

Sep 5, 2022

Scaling Alibaba TCC to Millions of RPS with a High‑Availability Real‑Time Data Warehouse

Introduction

Alibaba TCC (Taocai Cai) is a community group‑buying service that has been operating for over six years. To support rapid business growth, TCC migrated to a high‑availability real‑time data warehouse (Hologres 2.0) in 2020, handling tens of millions of RPS writes, hundreds of terabytes of data, and second‑level query responses.

1 TCC Business Overview

TCC originated from Hema Select and Retail‑Tong, merging in early 2021. It serves fresh produce and daily goods via next‑day delivery across multiple channels (WeChat, Taobao, Alipay). Daily active users reach tens of millions, and data volume is on the order of hundreds of terabytes.

2 Evolution Timeline: From Database to High‑Availability Real‑Time Warehouse

The architecture progressed through four stages:

Stage 1 (pre‑2016): jstorm‑based real‑time screen and reporting.

Stage 2 (2016‑2020): Flink SQL with growing real‑time tasks (500+).

Stage 3 (2020): Flink + Hologres for unified real‑time storage.

Stage 4 (2021): Flink + Hologres with read‑write separation and high availability.

Stage 1 – jstorm

Initial real‑time warehouse for internal dashboards, supporting fewer than 10 real‑time jobs with ~3,000 CU resources.

Stage 2 – FlinkSQL

Scaled to support real‑time screens, marketing analysis, and channel analytics, increasing real‑time jobs to over 500 and resource usage by 30%.

Stage 3 – Flink + Hologres

Addressed architecture bloat and operational issues by adopting Hologres, enabling unified storage, faster debugging, and longer data lifecycles.

Stage 4 – High‑Availability (2021)

Focused on stability and cost governance, introducing primary‑secondary write paths, multi‑instance isolation, and resource‑saving measures that saved over 2 million CU in 2021.

3 Typical Real‑Time Warehouse Use Cases

Executive real‑time reporting: low‑latency, high‑reliability dashboards for senior management.

Logistics real‑time operations: continuous online queries and long‑term consumption.

Metric center & self‑service analysis: ad‑hoc multi‑dimensional reporting.

Growth platform (experiments & audience selection): multi‑dimensional analysis on billions of rows.

Daily operational reports (ShuLaiBao): fast delivery of historical and real‑time data.

4 Review of the Legacy Architecture and Its Problems

The pre‑2020 architecture was fragmented, leading to data inconsistency, low development efficiency, high O&M cost, long rollback times, and wasted real‑time resources.

5 First Architecture Upgrade: Flink + Hologres 1.0

5.1 Unified Data Warehouse Design

Data sources were unified via TT (Datahub). CDM and ADS layers were replaced with Hologres row‑store tables, enabling direct binlog subscription, simplified debugging, and longer data retention.

5.2 Challenges and Solutions

Performance bottlenecks from unrestricted Hologres usage were mitigated by optimizing table groups, shard counts, and indexes, and by isolating critical workloads.

Metric consistency issues caused by post‑processing were addressed through data monitoring, a centralized metric center, and pushing high‑frequency logic into Hologres.

5.3 Benefits

Unified architecture covering all real‑time scenarios.

Higher development efficiency with SQL‑first approach.

Reduced layers, faster rollbacks, and longer data lifecycles.

Strong scalability for OLAP and point queries.

Seamless integration of real‑time and offline data.

6 Second Architecture Upgrade: Flink + Hologres 2.0 (High‑Availability)

6.1 Enterprise‑Level Stability and Cost Governance Needs

Issues included configuration drift, unstandardized table designs, resource waste, and long recovery times.

6.2 Solutions

Implemented a primary‑secondary chain for the public layer (Shanghai TT primary, Zhangbei backup) and Hologres row‑store instances for the application layer (1 primary, 3 replicas) with same‑city disaster recovery.

Adopted strict development standards, rapid incident response (3‑5‑10‑30 minute SLA), and continuous post‑mortem improvements.

Cost‑saving measures involved cleaning unused tables, setting appropriate lifecycles, and optimizing binlog usage.

6.3 Benefits

Achieved true read‑write separation, rapid failover, and significant stability improvements, allowing teams to focus on business development.

7 Next‑Generation Real‑Time Warehouse: Efficiency and Cost Reduction

Supports >1 billion rows with stable query performance.

Row‑store for point queries, column‑store for OLAP, materialized views, and binlog for downstream consumption.

Rapid delivery via view‑based or FBI data sets.

Monthly Hologres incidents dropped from ~5 to <1, and instance restart time reduced from >50 minutes to ~20 minutes, achieving 99% continuous availability.

Cost reductions of several million RMB were realized by eliminating unused tables and optimizing storage.

8 Future Outlook

Further stability through stricter Hologres governance, multi‑replica shards, and enhanced risk assessment.

Higher efficiency by exploring materialized views, schemaless designs, and row‑column coexistence to cover 100% of OLAP use cases.

Continued cost control via dynamic scaling, storage‑compute governance, and lifecycle management.

real-time Flink high availability Hologres big-data data-warehouse cost-optimization

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

1 TCC Business Overview

2 Evolution Timeline: From Database to High‑Availability Real‑Time Warehouse

Stage 1 – jstorm

Stage 2 – FlinkSQL

Stage 3 – Flink + Hologres

Stage 4 – High‑Availability (2021)

3 Typical Real‑Time Warehouse Use Cases

4 Review of the Legacy Architecture and Its Problems

5 First Architecture Upgrade: Flink + Hologres 1.0

5.1 Unified Data Warehouse Design

5.2 Challenges and Solutions

5.3 Benefits

6 Second Architecture Upgrade: Flink + Hologres 2.0 (High‑Availability)

6.1 Enterprise‑Level Stability and Cost Governance Needs

6.2 Solutions

6.3 Benefits

7 Next‑Generation Real‑Time Warehouse: Efficiency and Cost Reduction

8 Future Outlook

Alibaba Cloud Big Data AI Platform

How this landed with the community

Was this worth your time?

0 Comments

Stage 1 – jstorm

Stage 2 – FlinkSQL

Stage 3 – Flink + Hologres

Stage 4 – High‑Availability (2021)

5 First Architecture Upgrade: Flink + Hologres 1.0

6 Second Architecture Upgrade: Flink + Hologres 2.0 (High‑Availability)