How Kuaishou Guarantees Real‑Time Data Warehouse Reliability During Billion‑Scale Events
This article details Kuaishou’s real‑time data warehouse architecture and its comprehensive assurance framework—including forward lifecycle standards, reverse fault‑injection testing, and Spring Festival event practices—highlighting challenges of massive traffic, high timeliness, accuracy, and stability, and outlining future plans for automation, batch‑stream integration, and cost reduction.
Abstract: This article compiles the talk by Kuaishou real‑time computing data team expert Li Tianshuo at Flink Forward Asia 2021, covering business characteristics and real‑time warehouse challenges, Kuaishou’s assurance architecture, Spring Festival event practice, and future plans.
1. Business Characteristics and Real‑Time Warehouse Pain Points
Kuaishou’s biggest business characteristic is massive data volume . Daily inbound traffic reaches trillion‑level, requiring efficient model design and performance‑optimized data source reading and standardization.
Diverse demands . Scenarios include activity dashboards, B2B and B2C applications, internal core dashboards, and real‑time search support, each with different reliability requirements, necessitating link prioritization and unified dimensions/models.
Frequent activity scenarios with high requirements . Activities need to reflect company KPI impact, analyze real‑time participation, and adjust strategies (e.g., monitor red‑packet cost) within a few weeks, demanding high stability.
Core Kuaishou scenarios . Real‑time metrics for executives and C‑end applications (e.g., Kuaishou Shop, Creator Center) require extremely high data precision and immediate issue detection.
These factors form the necessity of Kuaishou’s real‑time warehouse construction and assurance.
In the initial assurance stage, Kuaishou borrowed offline assurance processes and divided the lifecycle into three phases: development, production, and service.
Development phase : Established model design standards, development standards, and a release checklist.
Production phase : Built low‑level monitoring for timeliness, stability, and accuracy, and optimized SLA based on monitoring.
Service phase : Defined upstream service standards, assurance levels, and value evaluation.
Compared with offline, real‑time faces higher learning costs and several unresolved issues:
Development phase : Flink SQL has a steeper learning curve than Hive SQL, and handling traffic spikes in real‑time is uncertain; DWD layer duplicate consumption also stresses resources.
Production phase : State lacks cleanup mechanisms, leading to state growth and frequent job failures; high‑ and low‑priority deployments require data‑center isolation, increasing cost.
Service phase : Real‑time job failures or restarts cause data duplication or loss, requiring standardized solutions unlike offline where consistency is easier.
Key real‑time assurance challenges include high timeliness, complexity, massive data flow, random problem occurrence, and uneven development capabilities.
2. Kuaishou Real‑Time Warehouse Assurance Architecture
To address the above difficulties, two complementary approaches are designed: a forward assurance based on the development lifecycle and a reverse assurance based on fault injection and scenario simulation.
2.1 Forward Assurance
Overall forward assurance flow:
Development stage : Conduct requirement research and standardize both foundational and application‑level development, covering 80% of common needs; the remaining 20% are handled via solution reviews and gradually standardized.
Testing stage : Perform quality verification, offline‑real‑time consistency comparison, and stress‑test resource estimation.
Release stage : Prepare pre‑release plans, confirm pre‑deployment actions, deployment methods, and post‑deployment inspection mechanisms.
Service stage : Implement monitoring and alarm mechanisms to keep services within SLA.
Decommission stage : Reclaim resources and restore deployments.
Kuaishou’s real‑time warehouse consists of three layers:
DWD layer : Stable logic with three data formats (client, server, binlog). Operations include scene splitting (generating sub‑topics), field standardization (dimension cleaning, dirty data filtering, IP‑lat/long mapping), and dimension association (using KV storage with secondary cache).
DWS layer : Provides dimension‑ and minute‑window aggregation for downstream reuse and single‑entity granularity data to reduce DWD pressure; also performs dimension expansion.
ADS layer : Multi‑dimensional aggregation based on DWD and DWS outputs.
To meet activity metric requirements (e.g., per‑minute cumulative curves), Kuaishou developed a progressive window solution with a day‑level window and minute‑level output. Data are partitioned by key, watermarks advance, and small windows emit bitmap and PV calculations, ensuring late or out‑of‑order data are retained, reducing data discrepancy from 1% to 0.5% and keeping latency within one minute.
2.2 Reverse Assurance
Since online activity testing cannot fully simulate production load, reverse assurance focuses on stress testing and fault injection.
Stress testing involves single‑job benchmarks to determine resource distribution and full‑link tests to ensure cluster resources stay within safe thresholds while handling peak traffic.
Fault injection covers single‑job failures (Kafka topic, Flink job, Flink job checkpoint) and systemic failures (link failure, data‑center outage, extreme traffic spikes, prolonged lag). Disaster recovery includes dual‑data‑center Kafka replication, hot‑standby Flink deployments, and automatic traffic switching.
Capacity assurance uses benchmarked maximum ingress rates as throttling limits; when traffic exceeds expectations, source throttling protects downstream jobs, and lag recovery time is estimated from observed lag and ingress rates.
3. Spring Festival Real‑Time Assurance Practice
Key requirements: high stability, high timeliness (second‑level dashboard latency, minute‑level curve latency), high accuracy (offline‑real‑time deviation ≤0.5%), and high flexibility for multi‑dimensional analysis.
Forward measures include a monitoring‑alarm system covering timeliness, accuracy, and stability SLA targets, plus link‑level monitoring of tasks, services, and cluster resources. Standardized development (80% via templates), testing (offline‑real‑time consistency), and staged deployment with task inspection are applied.
Reverse measures consist of progressive window development, single‑job and full‑link stress tests, and disaster‑recovery mechanisms (dual‑data‑center deployment, throttling, retry, degradation). Fault drills simulate component failures and traffic peaks to validate the effectiveness of the safeguards.
The Spring Festival deployment achieved second‑level dashboard latency, minute‑level curve latency, ≤0.5% accuracy deviation, and seamless failover, demonstrating the robustness of the assurance framework.
4. Future Plans
Assurance capability building : Standardize stress‑test and fault‑injection scripts, automate execution via platform, and apply intelligent diagnosis to capture expert knowledge.
Batch‑stream integration : Consolidate batch and streaming pipelines through unified SQL, enabling platform‑wide development efficiency and load balancing.
Real‑time warehouse expansion : Enrich content layers, mature development components, and promote SQL‑based solutions to improve productivity and reduce costs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
