Big Data 23 min read

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

This article examines the principles, features, and implementation details of distributed scheduling for big‑data ETL pipelines, covering decentralised schedulers, host selection strategies, fault‑tolerance, operator abstraction, elasticity, trigger mechanisms, visual monitoring, alarm handling, data fan‑in/fan‑out, parameter consistency, real‑time quality checks, lineage tracking, and field‑level traceability.

Architecture Digest

Sep 2, 2017

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

Introduction

Big‑data distributed scheduling plays a pivotal role in the ETL process, linking data production, delivery, and consumption. The article outlines scheduling fundamentals, distributed scheduling characteristics, and practical guidance for building a highly available, efficient, and flexible big‑data scheduling platform.

1. Scheduling

Since the 1950s, scheduling has been studied in mathematics, operations research, and engineering as the allocation of resources to tasks to optimise execution time or cost. Traditional OS‑level triggers (e.g., Linux Crontab) provide at‑least‑once execution, while business‑level designs may tolerate multiple executions to achieve eventual consistency.

When upstream/downstream collaboration, multi‑host execution, or high‑availability requirements arise, a distributed scheduling architecture becomes necessary.

2. Distributed Scheduling

Distributed scheduling evolves from single‑node scheduling to collaborative multi‑node coordination, introducing features absent in single‑node setups.

2.1 Decentralised Scheduler & High Availability

Multiple scheduler nodes are deployed in a master‑slave or active‑passive configuration to eliminate single‑point failure.

2.2 Host Selection

Tasks can be executed on all hosts, or a subset (N≥M≥1) using passive (random, round‑robin, resource‑aware) or active (zookeeper‑based lock) selection strategies.

2.3 Fault Transfer

When a task fails on its original host, the scheduler retries or migrates the task to a backup host to maximise successful execution.

2.4 Operator Abstraction

Operators expose a standard interface (REST, RPC, etc.) allowing scripts, web services, HDFS commands, or database operations to be invoked uniformly across heterogeneous environments.

2.5 Elastic Scaling

Hosts are added or removed via IP‑list or BNS (Baidu NameServer) to meet load and availability demands.

2.6 Trigger Mechanisms

Beyond simple Crontab intervals, distributed workflows support dependency‑driven triggers, fine‑grained time windows, and hierarchical activation.

2.7 Blocking Mechanisms

Three modes—regular, discard‑later, discard‑earlier—prevent overlapping executions and provide replay capabilities.

2.8 Visual Progress Monitoring

Tree‑structured graphs display job, transformer, and operator status, including real‑time logs and progress bars.

2.9 Alarms

Email or SMS alerts fire when tasks deviate from expectations, with configurable thresholds and silencing for maintenance windows.

3. Big‑Data Distributed Scheduling

Building on generic scheduling, big‑data platforms add data‑centric features.

3.1 Data Fan‑In/Fan‑Out

Open‑SQL abstracts cross‑engine data movement (e.g., HBase → MySQL/Elasticsearch) and defines primary keys, columns, partition fields, update ranges, multi‑step processes, operation types, and black‑box exposure.

3.2 Collaborative Parameter Consistency

Global and local parameters (timestamps, IPs, task IDs) are propagated through the workflow, with scope and mutability rules governing upstream‑downstream transmission.

3.3 Real‑Time Data Quality Check

Check operators validate data against historical baselines or external sources, supporting full‑sample and sampled comparisons, and trigger alarms on failure.

3.4 Data Lineage

Open‑SQL enables tracing of upstream dependencies and downstream impacts at table‑ and field‑level, essential for debugging and impact analysis.

3.5 Transformer Evolution

Jobs, Transformers, and Operators represent coarse‑to‑fine granularity in data pipelines; Transformers correspond to tables, Operators to atomic actions.

3.6 Field‑Level Traceability

Combining lineage information with target engine updates enables precise back‑tracking of individual fields.

3.7 Signal Lamp

A messaging middleware (producer/consumer) publishes signals after data production and quality checks, decoupling downstream consumption and indicating data readiness.

Conclusion

Big‑data distributed scheduling tightly couples ETL definitions, data engines, and business scenarios. Baidu Takeaway’s V2.0 scheduler demonstrates iterative improvements addressing data explainability and fine‑grained update latency, with an eye toward future open‑source release.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline Distributed Scheduling High Availability data lineage ETL

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.