Real-Time Data Warehouse Construction at TAL Using DorisDB
This article details TAL's transition from offline to real-time data warehousing, describing business drivers, pain points, architectural evolution through Hive, Flink+Kudu, and DorisDB, and outlining the system design, data flow, scheduling, monitoring, and the resulting business and cost benefits.
TAL (NYSE: TAL) is a technology-driven education company that, by the end of November 2020, operated 990 teaching sites across 102 cities and served over 331 prefecture-level cities in China as well as more than 20 overseas regions.
Rapid growth in real‑time analytics needs for marketing recommendation, attribution analysis, and operational decision‑making led TAL to adopt DorisDB for a unified, fast, efficient, and flexible real‑time data warehouse, achieving sub‑second query responses.
Business Background and Scenarios
Data analysis in education is divided into offline (batch) and real‑time (minute‑level) processing. Offline handles eight data domains (logs, marketing, transactions, services, teaching, content, learning, profiles) for historical, multi‑dimensional analysis. Real‑time processing focuses on high‑volume, structured or semi‑structured logs and binlog‑derived business data, crucial for marketing, sales, student renewal, classroom behavior analysis, and algorithm model updates.
Business Pain Points
Marketing: need immediate insight into PV/UV and rapid optimization of ad strategies.
Sales: require real‑time tracking of intent users and daily sign‑ups.
Student renewal: need instant detection of renewing students.
Classroom behavior: real‑time interaction and assessment analysis.
Algorithm models: timely feature updates for accurate predictions.
Real‑Time Warehouse Goals
Provide flexible, minute‑level data with high richness, accuracy, and timeliness, preserving offline data model dimensions while ensuring data freshness.
Technical Evolution
Hive (2018‑2019) : Hour‑level tasks built on Hive DAGs reduced latency from days to hours but still suffered from high latency, MapReduce overhead, and resource contention.
Flink + Kudu (2019) : Stream processing with Flink and incremental storage in Kudu addressed hour‑level limitations but introduced complex stream semantics, lower query performance, high maintenance cost, and limited community support.
DorisDB (2020‑present) : Chosen after evaluating ClickHouse, Kylin, and others for its strong join capability, cost‑based optimizer, and ability to develop real‑time pipelines using familiar offline Hive‑style SQL, delivering minute‑level visibility with low operational overhead.
System Architecture
The overall system comprises three layers: data sources (business binlog via Flink to Kafka, log data via Flume to Kafka), data storage (DorisDB Routine Load, Broker Load, and Stream Load), and data applications (DataX export, DorisDB Export, MySQL connector for BI tools).
Airflow schedules DAG tasks for incremental data loading from ODS to DWD (5‑minute intervals) and further aggregation to DWS and BI layers. Monitoring uses Grafana + Prometheus to track DorisDB health, with three alert types for Kafka consumption, task execution errors, and schema changes.
Benefits
Business: Real‑time features improve model accuracy (e.g., AUC gains of 2‑3% for intent‑user identification) and accelerate decision‑making.
Cost: SQL‑based development reduces complexity versus stream semantics; wide‑table and star‑schema designs boost query performance and flexibility; deployment and maintenance are simplified to a single DorisDB cluster.
Performance: Query speed increased tenfold after migrating BI dashboards to DorisDB.
Future Outlook
DorisDB continues to support a growing number of real‑time analytics scenarios across TAL’s business lines, with plans to further unify real‑time and offline analysis for stronger data‑driven decision support.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.