Big Data 21 min read

How Kuaishou Scaled Its Big Data Platform to Handle EB‑Level Data and Millions of Daily Tasks

This article details Kuaishou's one‑stop big data development platform, covering its massive scale, low‑code and real‑time capabilities, multi‑layer architecture, SLA guarantees, diagnostic tools, and future plans to further lower development barriers and democratize data engineering.

Data Thinking Notes

Jan 16, 2023

How Kuaishou Scaled Its Big Data Platform to Handle EB‑Level Data and Millions of Daily Tasks

0 Introduction

Big data is a crucial production force for enterprise development, and data development is the main battlefield for building data assets. With Kuaishou's rapid business growth, users from product, technology, operations, and data engineering roles increasingly engage in data development, posing challenges such as low‑threshold access, massive EB‑scale data, millions of daily tasks, operational cost reduction, and timely data delivery.

1 Kuaishou Big Data Platform

Kuaishou is a short‑video community with 347 million daily active users, 586.7 million monthly active users, and an average usage time of 125.2 minutes. Over 25% of monthly active users are content creators, and the platform has accumulated more than 20 billion mutual connections.

The platform supports short video, live streaming, e‑commerce, and advertising, operating at a scale of tens of thousands of machines, EB‑level total data, PB‑level daily growth, and hundreds of thousands of daily tasks.

2 Kuaishou Big Data Development Platform Practice

2.1 Platform Positioning and Evolution

The big data lifecycle includes six stages: data reporting, data collection, data synchronization, data processing, data distribution, and data service.

Kuaishou's one‑stop platform covers data synchronization, processing, and distribution, providing a unified development and operation experience.

Development history is divided into three phases:

Original era (2016): Development relied on open‑source tools such as Airflow and Sqoop, with a few hundred users and thousands of tasks.

1.0 era (2019‑2021): A unified data tool team built a one‑stop platform covering sync, offline, and real‑time development, scaling to thousands of users and hundreds of thousands of tasks.

2.0 era (2022 onward): With diverse users (over 90% non‑data engineers), the platform introduced low‑code and intelligent features to lower the development threshold.

2.2 Overall Architecture

The architecture consists of three layers:

Engine layer: Unified scheduling engine Kwaiflow and data catalog Kwaicat.

Service layer: Separate services for data sync, offline development, and real‑time development.

Product layer: Unified UI offering data query, development center, operation center, and system management.

2.2.1 Data Synchronization

Typical scenarios include real‑time sync from Kafka to Hive/ClickHouse and offline batch sync between heterogeneous stores.

Architecture layers:

API layer: Task creation, update, and monitoring.

Storage layer: Stores metadata such as task info and checkpoints.

Scheduling layer: Real‑time sync uses a coordinator for scheduling and an executor for data pulling, parsing, and writing; offline sync relies on Kwaiflow.

Challenges such as massive traffic and tail‑task delays are addressed by multithreaded processing, hierarchical throttling, task re‑distribution, and benchmark‑based resource balancing, reducing average tail time to about 2 minutes.

2.2.2 Offline Development

Offline development provides scheduling via Kwaiflow, code assistance, syntax validation, environment isolation, and comprehensive operation capabilities including task diagnostics and health checks.

Key issues include low development efficiency, poor quality, and complex task dependency chains. The platform standardizes the development workflow into five steps: task writing, configuration, debugging, review, and deployment, with SQLScan for parsing and validation, and SQL rendering for environment isolation.

2.2.3 Real‑Time Development

Real‑time development supports Flink SQL/Jar tasks, offering API, service, and monitoring layers. To lower the barrier, logical tables abstract heterogeneous sources, allowing users to write SQL that is automatically optimized into Flink SQL, improving development efficiency by over 70% compared to Jar mode.

2.2.4 SLA Assurance

Given the growing task volume and complex dependencies, Kuashou adopts a hierarchical SLA strategy combining organization, standards, and tooling. Tasks are prioritized from P0 to P3, with resources allocated accordingly. Tools provide priority management, progress prediction, and alerting, enabling early detection of potential delays (typically 90 minutes before impact).

2.3 Low‑Code Development Practice

2.3.1 Background

Client‑side event data includes business and technical points, with technical analysis facing long scheduling cycles and high entry barriers.

2.3.2 Solution Approach

For business events, the focus is on quality and complex processing; for technical events, the focus is on efficiency and simple processing. Kuaishou builds layered capabilities: standard templated tasks, custom chained tasks, and scenario‑specific solutions.

2.3.3 Technical Architecture

The low‑code service leverages metadata, offline, and real‑time capabilities, using a “configuration‑as‑production” model where users configure tasks via forms, automatically generating and executing optimized code.

2.3.4 Benefits

Benefits include lowered entry barriers, streamlined operation paths (single‑platform workflow), cost reduction through resource‑efficient execution, and an average productivity increase of 70% for users.

3 Future Planning

Beyond the 2.0 era, Kuaishou aims to further democratize data development through scenario‑driven low‑code interfaces, automatic logical‑to‑physical model generation, intelligent scheduling and diagnostics, and a unified batch‑stream architecture based on Hudi + Flink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data real-time analytics Data Platform Low-Code Development SLA management

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.