How Kuaishou Guarantees Real‑Time Data Warehouse Performance at Billion‑Scale Events
This article details Kuaishou's real‑time data warehouse architecture, the business challenges of massive traffic and diverse requirements, and the forward‑ and reverse‑assurance strategies—including lifecycle standards, monitoring, fault‑injection testing, and a Spring Festival case study—that together ensure high stability, low latency, and sub‑0.5% accuracy for billion‑scale streaming workloads.
Business Characteristics and Real‑Time Warehouse Challenges
Kuaishou processes trillions of events per day, requiring a data pipeline that can handle massive volume, diverse 2B/2C use‑cases, frequent activity windows, and sub‑second latency for executive dashboards and consumer‑facing applications.
Large data volume : Model design and source‑side performance must be optimized to avoid excessive reads.
Diverse requirements : Activities, dashboards, B2B/B2C services and search each have distinct SLA targets; a unified model reduces duplicated effort.
Frequent activity scenarios : Hundreds of metrics must be delivered within 2‑3 weeks of development, demanding high reliability.
Core real‑time scenarios : Executive and consumer metrics need instant, highly accurate visibility.
Real‑Time Warehouse Assurance Architecture
The assurance system mirrors offline processes but is adapted for streaming and is divided into three lifecycle stages.
Development stage : Define model‑design standards, development guidelines, and release checklists.
Production stage : Implement low‑level monitoring for timeliness, stability and accuracy; apply SLA‑driven optimizations.
Service stage : Specify upstream service standards, guarantee levels and value‑assessment metrics.
Key technical difficulties include the steep learning curve of Flink SQL, state‑size management, resource contention and unpredictable failure patterns.
Forward (Positive) Assurance
Standardize ~80 % of routine requirements across the lifecycle; the remaining 20 % are handled via review‑driven solutions.
Lifecycle Practices
Development : Conduct requirement analysis and provide template‑based SDKs for base‑layer and application‑layer development.
Testing : Perform quality verification, offline‑real‑time consistency checks and stress‑test resource estimation.
Release : Prepare pre‑deployment plans, verify deployment steps and define post‑release inspection procedures.
Service : Deploy monitoring and alerting to keep services within SLA.
Decommission : Reclaim resources and restore deployments.
Warehouse Layer Design
DWD layer : Stable logical processing of three data formats (client, server, binlog). Operations include scene splitting (sub‑topic generation), field standardization (dimension cleaning, dirty‑data filtering, IP‑lat/long mapping) and dimension association (KV + cache).
DWS layer : Minute‑level windowed aggregation and entity‑level aggregation to offload DWD pressure. Supports both dimension‑based minute windows and per‑entity (user/device) aggregates.
ADS layer : Multi‑dimensional aggregation for final output (e.g., PV/UV, ranking, KPI tables).
Standardized SDKs encapsulate common logic, eliminating code duplication and consolidating best practices.
Progressive Window Solution
To compute per‑minute curves with high accuracy under out‑of‑order data, Kuaishou introduced a progressive window with two parameters: a day‑level window and a minute‑level step.
-- Pseudo‑SQL illustration
SELECT
device_id,
TUMBLE_START(event_time, INTERVAL '1' DAY) AS day_window,
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS minute_step,
COUNT(*) AS pv,
BITMAP_UNION(pv_bitmap) AS bitmap
FROM source_table
GROUP BY device_id, day_window, minute_step;The algorithm partitions records by device_id into the same task, advances watermarks per minute step, and merges partial results in a global window. Late or out‑of‑order events are retained, reducing data loss and shrinking the overall error from ~1 % to <0.5 % while keeping curve latency under one minute.
Monitoring & SLA Metrics
Accuracy : Offline‑real‑time consistency, OLAP‑API consistency, and metric‑logic error alerts.
Timeliness : Input latency (ms), processing latency, output latency; thresholds include sub‑second for dashboards and ≤1 min for curves.
Stability : Service and OLAP engine uptime, Flink job recovery time, CPU/IO/Memory usage.
Reverse (Negative) Assurance
Stress‑test the full pipeline and inject failures to validate resilience.
Stress‑Test Criteria
Single‑job stress test to determine resource distribution and cluster orchestration.
Full‑link stress test to verify:
Input latency remains in milliseconds.
CPU utilization ≤ 60 % (leaving buffer for spikes).
Result consistency with expected population‑pack data.
Fault‑Injection Scenarios
Kafka topic failure.
Flink job crash.
Checkpoint failure.
Data‑center outage (multi‑room failover).
Spring Festival Activity Real‑Time Assurance Practice
Requirements : sub‑second latency for dashboard metrics, minute‑level latency for curves, ≤0.5 % accuracy deviation, and flexible multi‑dimensional analysis.
Measures :
Forward‑facing monitoring & alerting (SLA, chain‑level, cluster‑level).
Reverse‑facing stress testing and capacity planning.
Disaster‑recovery: dual‑data‑center Kafka, hot‑standby Flink clusters, automated throttling.
Results : Core metrics achieved sub‑second latency, accuracy deviation <1 %, seamless failover, and successful handling of trillion‑scale data during peak traffic.
Future Planning
Standardize stress‑test and fault‑injection playbooks with automated execution and intelligent diagnosis.
Integrate batch and streaming pipelines via unified SQL to improve development efficiency and balance cluster load.
Expand real‑time warehouse capabilities, component libraries and SQL‑based development to reduce cost and increase productivity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
