Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations
The article presents an in‑depth overview of Tencent's self‑developed Unified Scheduling Platform, detailing its system architecture, design challenges, performance optimizations, resource‑fair scheduling mechanisms, operational metrics, future roadmap, and a Q&A session that together illustrate how the platform enables massive offline data processing at scale.
01 System Overview
The Unified Scheduling Platform is Tencent's self‑developed distributed offline task scheduler that coordinates massive data collection, analysis, and export workflows, acting as a bridge between data‑application interfaces and downstream compute/storage resources.
Typical Scenario
Data is collected from user‑generated sources (text, relational, message data), ingested into an ODS layer, processed via Hive or Spark into DWD/DWS layers, and finally exported to downstream services; the platform orchestrates these steps as a DAG of dependent tasks.
02 System Design
1. First‑generation architecture challenges
The original design could not keep up with growing task volume, suffering from core module scaling limits, database bottlenecks, insufficient resource control, priority mis‑management, and low‑throughput instance generation.
2. Open‑source scheduling solutions
Existing open‑source tools such as Airflow struggle with million‑task scales and high maintenance costs; Tencent required a horizontally scalable solution tailored to its massive, cross‑region workloads.
3. Next‑generation architecture
The new design emphasizes scalability, high availability, high throughput, and flexibility. It introduces a BaseMaster to eliminate single points of failure, replaces MySQL with the distributed Tbase database, adopts HBase for cold‑data storage, and implements a resource‑fair scheduling algorithm to dynamically control concurrency.
Instance Generation
Performance bottlenecks were addressed by sharding tasks, applying hash‑based bucket sorting, and mini‑batch writes, achieving a 30× speedup and enabling minute‑level scheduling for millions of daily tasks.
Task Dispatch
Two major issues—report latency and service pressure—were mitigated by introducing dynamic priority calculation (based on deadline proximity, business criticality, frequency, job type, and dependency depth) and dynamic resource control that abstracts all services as resources with quota management.
The resource‑fair scheduler selects the highest‑priority tasks that satisfy concurrency quotas through a five‑step process: shard loading, hash bucketing, intra‑bucket sorting, intra‑bucket concurrency check, and final cross‑bucket ordering.
03 Operational Status
Instance generation performance improved 30×; minute‑level scheduling supported.
Resource‑fair scheduling ensures timely report delivery and eliminates crashes due to task overload.
Current scale: tens of millions of daily tasks, thousands of clusters, supporting 80+ task types across multiple business units.
04 Future Plans
Short‑term: continue performance tuning, enhance self‑diagnosis, and improve operational automation.
Long‑term: build a one‑stop data development platform.
Q&A Highlights
Q1: Scheduling uses both polling and trigger modes; implementation language is Java.
Q2: Hot tasks reside in Tbase, cold completed tasks are stored in HBase.
Q3: Automatic scaling is achieved via BaseMaster‑driven shard redistribution.
Q4: Intermediate outputs are stored in temporary Hive tables or HDFS directories and cleaned up after job completion.
Q5: Failure handling includes retry‑backoff, DB write fallback to disk, and horizontal migration of scheduling cores.
For more details, scan the QR code to view the replay or download the PPT.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
