Big Data 22 min read

Kuaishou Big Data Task Scheduling System: Architecture, Challenges, and Key Technologies

This article presents Kuaishou's large‑scale big‑data task scheduling system, describing its evolution from Airflow to the self‑developed Kwaiflow, the performance and reliability challenges of handling hundreds of thousands of tasks, and the design decisions that achieve low latency, high availability, and strong open capabilities.

DataFunTalk

Apr 12, 2022

Kuaishou Big Data Task Scheduling System: Architecture, Challenges, and Key Technologies

The big‑data task scheduling system is a core component of Kuaishou's data middle platform, responsible for orchestrating all offline data processing jobs. As daily task volume grew to hundreds of thousands with millions of dependencies, the system faced performance degradation and stability issues.

The presentation is divided into four parts: background and challenges, overall system design, key technologies, and applications & results.

Background and Challenges

Massive task volume: tens of thousands of tasks and over a million dependency edges, stressing scheduler performance.

Complex dependency graph: tasks may have tens of thousands of upstream or downstream dependencies, affecting stability.

Diverse scenarios: multiple execution modes, task types, and scheduling requirements demand rich functionality and extensibility.

Goals

High performance: support million‑scale tasks with scheduling latency at second or sub‑second level.

High availability: 99.99% uptime, rapid fault detection and recovery.

Rich functionality: extensive scheduling capabilities, open APIs, and ecosystem integration.

System Evolution

2016 – Airflow used as the initial engine, but could not meet growing demands.

2019 – Self‑developed Kwaiflow 1.0 launched to support million‑scale tasks.

2020 – Kwaiflow 2.0 added routine, trigger, backfill, and block scheduling, integrating quality and security.

2021 H2 – Kwaiflow 3.0 built for second‑level scheduling and ten‑million‑scale tasks.

Overall Design

Kwaiflow follows a two‑layer entity model: Task (execution template, e.g., HiveTask, BashTask) and DAG (a collection of tasks with scheduling attributes). DAGs can depend on each other across cycles.

Key components:

API Server – unified entry point for all requests.

Scheduler – multiple specialized schedulers (routine, trigger, backfill, block) that generate task instances, perform time, dependency, and resource checks, then push to the Queue Service.

Queue Service – ack‑enabled message queue with isolated channels.

Worker – executes tasks either locally (process‑based) or in containers (k8s), supporting fast start‑up and resource isolation.

Supporting services – logging, alerting, event handling, instance lineage, etc.

Key Technologies

Low Scheduling Latency

Four stages from instance creation to execution: time‑ready, dependency‑ready, resource‑ready, and environment‑ready. Optimizations include:

High‑throughput timer with millisecond precision.

Database access tuned with indexes, read/write splitting, and sharding to keep a single query within 1‑3 ms.

Event‑driven state transitions using Akka actors, eliminating polling and keeping latency in milliseconds.

Environment preparation via image pre‑loading and warm‑up; container start‑up reduced from minutes to seconds, local execution in sub‑second range.

High Availability

Design includes master‑slave architecture, multiple worker groups, and robust communication protocols (scheduler‑executor, executor‑runner, external‑job). Fault‑tolerance mechanisms:

Automatic failover with HA master and multi‑worker redundancy.

Message ack to avoid loss and periodic reconciliation to prevent missed executions.

State‑machine‑driven execution to avoid duplicate runs.

Tiered protection prioritizes high‑priority tasks when resources are scarce, ensuring critical jobs receive sufficient resources during peak events.

Open Capability

Open RESTful APIs with multi‑language SDKs and detailed integration guides.

Standardized events for task operations, instance state changes, and diagnostics, enabling full lifecycle reconstruction.

Plugin framework for custom task types.

Pre‑hook and post‑hook mechanisms for code rendering, validation, and post‑execution quality checks.

Applications, Results, and Future

The Integrated Development Platform uses Kwaiflow as its orchestration engine, providing one‑stop offline data development, data quality, and service‑oriented APIs.

Compared with Airflow, Kwaiflow achieves:

Task capacity increased from ~10 k to >100 k (million‑scale design).

P99 scheduling latency reduced from 5 minutes to < 5 seconds.

Start‑rate of tens of thousands of tasks per minute versus a few thousand.

Availability of 99.99% versus Airflow's 99.5% and annual failures reduced from ~8 to ~1.

It now serves dozens of platforms across Kuaishou (metrics, AB testing, user profiling, ML, etc.) and supports all major business units.

Future directions focus on:

More generic support for trigger‑based scheduling and complex online workflows.

Further latency reduction to sub‑second and scaling to tens of millions of instances.

Enhanced container resource management, automated testing, large‑scale simulation, and faster deployment pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems High Availability Task scheduling Kuaishou Kwaiflow

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.