Kuaishou Big Data Task Scheduling System: Architecture, Challenges, and Key Technologies
This article presents Kuaishou's large‑scale big‑data task scheduling system, describing its evolution from Airflow to the self‑developed Kwaiflow, the performance and reliability challenges of handling hundreds of thousands of tasks, and the design decisions that achieve low latency, high availability, and strong open capabilities.
The big‑data task scheduling system is a core component of Kuaishou's data middle platform, responsible for orchestrating all offline data processing jobs. As daily task volume grew to hundreds of thousands with millions of dependencies, the system faced performance degradation and stability issues.
The presentation is divided into four parts: background and challenges, overall system design, key technologies, and applications & results.
Background and Challenges
Massive task volume: tens of thousands of tasks and over a million dependency edges, stressing scheduler performance.
Complex dependency graph: tasks may have tens of thousands of upstream or downstream dependencies, affecting stability.
Diverse scenarios: multiple execution modes, task types, and scheduling requirements demand rich functionality and extensibility.
Goals
High performance: support million‑scale tasks with scheduling latency at second or sub‑second level.
High availability: 99.99% uptime, rapid fault detection and recovery.
Rich functionality: extensive scheduling capabilities, open APIs, and ecosystem integration.
System Evolution
2016 – Airflow used as the initial engine, but could not meet growing demands.
2019 – Self‑developed Kwaiflow 1.0 launched to support million‑scale tasks.
2020 – Kwaiflow 2.0 added routine, trigger, backfill, and block scheduling, integrating quality and security.
2021 H2 – Kwaiflow 3.0 built for second‑level scheduling and ten‑million‑scale tasks.
Overall Design
Kwaiflow follows a two‑layer entity model: Task (execution template, e.g., HiveTask, BashTask) and DAG (a collection of tasks with scheduling attributes). DAGs can depend on each other across cycles.
Key components:
API Server – unified entry point for all requests.
Scheduler – multiple specialized schedulers (routine, trigger, backfill, block) that generate task instances, perform time, dependency, and resource checks, then push to the Queue Service.
Queue Service – ack‑enabled message queue with isolated channels.
Worker – executes tasks either locally (process‑based) or in containers (k8s), supporting fast start‑up and resource isolation.
Supporting services – logging, alerting, event handling, instance lineage, etc.
Key Technologies
Low Scheduling Latency
Four stages from instance creation to execution: time‑ready, dependency‑ready, resource‑ready, and environment‑ready. Optimizations include:
High‑throughput timer with millisecond precision.
Database access tuned with indexes, read/write splitting, and sharding to keep a single query within 1‑3 ms.
Event‑driven state transitions using Akka actors, eliminating polling and keeping latency in milliseconds.
Environment preparation via image pre‑loading and warm‑up; container start‑up reduced from minutes to seconds, local execution in sub‑second range.
High Availability
Design includes master‑slave architecture, multiple worker groups, and robust communication protocols (scheduler‑executor, executor‑runner, external‑job). Fault‑tolerance mechanisms:
Automatic failover with HA master and multi‑worker redundancy.
Message ack to avoid loss and periodic reconciliation to prevent missed executions.
State‑machine‑driven execution to avoid duplicate runs.
Tiered protection prioritizes high‑priority tasks when resources are scarce, ensuring critical jobs receive sufficient resources during peak events.
Open Capability
Open RESTful APIs with multi‑language SDKs and detailed integration guides.
Standardized events for task operations, instance state changes, and diagnostics, enabling full lifecycle reconstruction.
Plugin framework for custom task types.
Pre‑hook and post‑hook mechanisms for code rendering, validation, and post‑execution quality checks.
Applications, Results, and Future
The Integrated Development Platform uses Kwaiflow as its orchestration engine, providing one‑stop offline data development, data quality, and service‑oriented APIs.
Compared with Airflow, Kwaiflow achieves:
Task capacity increased from ~10 k to >100 k (million‑scale design).
P99 scheduling latency reduced from 5 minutes to < 5 seconds.
Start‑rate of tens of thousands of tasks per minute versus a few thousand.
Availability of 99.99% versus Airflow's 99.5% and annual failures reduced from ~8 to ~1.
It now serves dozens of platforms across Kuaishou (metrics, AB testing, user profiling, ML, etc.) and supports all major business units.
Future directions focus on:
More generic support for trigger‑based scheduling and complex online workflows.
Further latency reduction to sub‑second and scaling to tens of millions of instances.
Enhanced container resource management, automated testing, large‑scale simulation, and faster deployment pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
