Big Data 18 min read

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

The article presents an in‑depth overview of Tencent's self‑developed Unified Scheduling Platform, detailing its system architecture, design challenges, performance optimizations, resource‑fair scheduling mechanisms, operational metrics, future roadmap, and a Q&A session that together illustrate how the platform enables massive offline data processing at scale.

DataFunTalk

Jan 3, 2023

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

01 System Overview

The Unified Scheduling Platform is Tencent's self‑developed distributed offline task scheduler that coordinates massive data collection, analysis, and export workflows, acting as a bridge between data‑application interfaces and downstream compute/storage resources.

Typical Scenario

Data is collected from user‑generated sources (text, relational, message data), ingested into an ODS layer, processed via Hive or Spark into DWD/DWS layers, and finally exported to downstream services; the platform orchestrates these steps as a DAG of dependent tasks.

02 System Design

1. First‑generation architecture challenges

The original design could not keep up with growing task volume, suffering from core module scaling limits, database bottlenecks, insufficient resource control, priority mis‑management, and low‑throughput instance generation.

2. Open‑source scheduling solutions

Existing open‑source tools such as Airflow struggle with million‑task scales and high maintenance costs; Tencent required a horizontally scalable solution tailored to its massive, cross‑region workloads.

3. Next‑generation architecture

The new design emphasizes scalability, high availability, high throughput, and flexibility. It introduces a BaseMaster to eliminate single points of failure, replaces MySQL with the distributed Tbase database, adopts HBase for cold‑data storage, and implements a resource‑fair scheduling algorithm to dynamically control concurrency.

Instance Generation

Performance bottlenecks were addressed by sharding tasks, applying hash‑based bucket sorting, and mini‑batch writes, achieving a 30× speedup and enabling minute‑level scheduling for millions of daily tasks.

Task Dispatch

Two major issues—report latency and service pressure—were mitigated by introducing dynamic priority calculation (based on deadline proximity, business criticality, frequency, job type, and dependency depth) and dynamic resource control that abstracts all services as resources with quota management.

The resource‑fair scheduler selects the highest‑priority tasks that satisfy concurrency quotas through a five‑step process: shard loading, hash bucketing, intra‑bucket sorting, intra‑bucket concurrency check, and final cross‑bucket ordering.

03 Operational Status

Instance generation performance improved 30×; minute‑level scheduling supported.

Resource‑fair scheduling ensures timely report delivery and eliminates crashes due to task overload.

Current scale: tens of millions of daily tasks, thousands of clusters, supporting 80+ task types across multiple business units.

04 Future Plans

Short‑term: continue performance tuning, enhance self‑diagnosis, and improve operational automation.

Long‑term: build a one‑stop data development platform.

Q&A Highlights

Q1: Scheduling uses both polling and trigger modes; implementation language is Java.

Q2: Hot tasks reside in Tbase, cold completed tasks are stored in HBase.

Q3: Automatic scaling is achieved via BaseMaster‑driven shard redistribution.

Q4: Intermediate outputs are stored in temporary Hive tables or HDFS directories and cleaned up after job completion.

Q5: Failure handling includes retry‑backoff, DB write fallback to disk, and horizontal migration of scheduling cores.

For more details, scan the QR code to view the replay or download the PPT.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Performance Optimization Big Data task scheduling Resource Management

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.