Big Data 22 min read

How Kuaishou Guarantees Real‑Time Data Warehouse Performance at Billion‑Scale Events

This article details Kuaishou's real‑time data warehouse architecture, the business challenges of massive traffic and diverse requirements, and the forward‑ and reverse‑assurance strategies—including lifecycle standards, monitoring, fault‑injection testing, and a Spring Festival case study—that together ensure high stability, low latency, and sub‑0.5% accuracy for billion‑scale streaming workloads.

ITPUB
ITPUB
ITPUB
How Kuaishou Guarantees Real‑Time Data Warehouse Performance at Billion‑Scale Events

Business Characteristics and Real‑Time Warehouse Challenges

Kuaishou processes trillions of events per day, requiring a data pipeline that can handle massive volume, diverse 2B/2C use‑cases, frequent activity windows, and sub‑second latency for executive dashboards and consumer‑facing applications.

Large data volume : Model design and source‑side performance must be optimized to avoid excessive reads.

Diverse requirements : Activities, dashboards, B2B/B2C services and search each have distinct SLA targets; a unified model reduces duplicated effort.

Frequent activity scenarios : Hundreds of metrics must be delivered within 2‑3 weeks of development, demanding high reliability.

Core real‑time scenarios : Executive and consumer metrics need instant, highly accurate visibility.

Real‑Time Warehouse Assurance Architecture

The assurance system mirrors offline processes but is adapted for streaming and is divided into three lifecycle stages.

Development stage : Define model‑design standards, development guidelines, and release checklists.

Production stage : Implement low‑level monitoring for timeliness, stability and accuracy; apply SLA‑driven optimizations.

Service stage : Specify upstream service standards, guarantee levels and value‑assessment metrics.

Key technical difficulties include the steep learning curve of Flink SQL, state‑size management, resource contention and unpredictable failure patterns.

Forward (Positive) Assurance

Standardize ~80 % of routine requirements across the lifecycle; the remaining 20 % are handled via review‑driven solutions.

Lifecycle Practices

Development : Conduct requirement analysis and provide template‑based SDKs for base‑layer and application‑layer development.

Testing : Perform quality verification, offline‑real‑time consistency checks and stress‑test resource estimation.

Release : Prepare pre‑deployment plans, verify deployment steps and define post‑release inspection procedures.

Service : Deploy monitoring and alerting to keep services within SLA.

Decommission : Reclaim resources and restore deployments.

Warehouse Layer Design

DWD layer : Stable logical processing of three data formats (client, server, binlog). Operations include scene splitting (sub‑topic generation), field standardization (dimension cleaning, dirty‑data filtering, IP‑lat/long mapping) and dimension association (KV + cache).

DWS layer : Minute‑level windowed aggregation and entity‑level aggregation to offload DWD pressure. Supports both dimension‑based minute windows and per‑entity (user/device) aggregates.

ADS layer : Multi‑dimensional aggregation for final output (e.g., PV/UV, ranking, KPI tables).

Standardized SDKs encapsulate common logic, eliminating code duplication and consolidating best practices.

Progressive Window Solution

To compute per‑minute curves with high accuracy under out‑of‑order data, Kuaishou introduced a progressive window with two parameters: a day‑level window and a minute‑level step.

-- Pseudo‑SQL illustration
SELECT
  device_id,
  TUMBLE_START(event_time, INTERVAL '1' DAY) AS day_window,
  TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS minute_step,
  COUNT(*) AS pv,
  BITMAP_UNION(pv_bitmap) AS bitmap
FROM source_table
GROUP BY device_id, day_window, minute_step;

The algorithm partitions records by device_id into the same task, advances watermarks per minute step, and merges partial results in a global window. Late or out‑of‑order events are retained, reducing data loss and shrinking the overall error from ~1 % to <0.5 % while keeping curve latency under one minute.

Monitoring & SLA Metrics

Accuracy : Offline‑real‑time consistency, OLAP‑API consistency, and metric‑logic error alerts.

Timeliness : Input latency (ms), processing latency, output latency; thresholds include sub‑second for dashboards and ≤1 min for curves.

Stability : Service and OLAP engine uptime, Flink job recovery time, CPU/IO/Memory usage.

Reverse (Negative) Assurance

Stress‑test the full pipeline and inject failures to validate resilience.

Stress‑Test Criteria

Single‑job stress test to determine resource distribution and cluster orchestration.

Full‑link stress test to verify:

Input latency remains in milliseconds.

CPU utilization ≤ 60 % (leaving buffer for spikes).

Result consistency with expected population‑pack data.

Fault‑Injection Scenarios

Kafka topic failure.

Flink job crash.

Checkpoint failure.

Data‑center outage (multi‑room failover).

Spring Festival Activity Real‑Time Assurance Practice

Requirements : sub‑second latency for dashboard metrics, minute‑level latency for curves, ≤0.5 % accuracy deviation, and flexible multi‑dimensional analysis.

Measures :

Forward‑facing monitoring & alerting (SLA, chain‑level, cluster‑level).

Reverse‑facing stress testing and capacity planning.

Disaster‑recovery: dual‑data‑center Kafka, hot‑standby Flink clusters, automated throttling.

Results : Core metrics achieved sub‑second latency, accuracy deviation <1 %, seamless failover, and successful handling of trillion‑scale data during peak traffic.

Future Planning

Standardize stress‑test and fault‑injection playbooks with automated execution and intelligent diagnosis.

Integrate batch and streaming pipelines via unified SQL to improve development efficiency and balance cluster load.

Expand real‑time warehouse capabilities, component libraries and SQL‑based development to reduce cost and increase productivity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

stress testingreal-time data warehouseFault InjectionKuaishouFlink streaminglarge-scale data processingSLA monitoring
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.