How Maat Transforms Distributed Workflow Scheduling for Alibaba Search

Maat is a Dag‑based distributed task scheduling platform built on Airflow that centralizes workflow management, adds visual editing, template handling, Drogo‑based deployment, and robust monitoring to meet the complex, asynchronous needs of Alibaba's search middle‑platform.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Maat Transforms Distributed Workflow Scheduling for Alibaba Search

Background

In the construction of Alibaba's search middle‑platform, a single system can no longer satisfy complex business requirements; multiple subsystems must cooperate asynchronously to complete specific functions, such as configuration sync, monitoring, resource updates, smoke testing, and engine creation, with branching, context passing, and retry mechanisms.

What is Maat?

Maat is a workflow scheduling system based on the open‑source project Airflow. It allows users to assemble custom workflow nodes, trigger processes at specified times (crontab format) or manually, and runs all nodes distributedly on Hippo, scheduled by Drogo. Users can create their own scheduler and executor nodes for resource isolation and configure execution environments and replica counts.

Why Build Maat?

Business code and scheduling code are tightly coupled, making workflow changes invasive.

Lack of a unified management system for scheduling tasks.

Complex multi‑branch workflows and context passing are poorly supported.

No user‑friendly visual UI.

Technical Selection

Common scheduling solutions include internal products D2, Workflow, and open‑source tools Airflow, Quartz, etc.

D2

Dag is recomputed daily, so new or modified tasks only take effect the next day.

No support for workflow context passing.

Limited integration with the search ecosystem (Hippo, Drogo, Kmon).

Workflow

Provides manual triggering and HSF calls but has complex configuration and limited external call support.

Quartz

Java‑based scheduler supporting distributed execution and persistence, but lacks workflow capabilities and requires code coupling.

Airflow

Decouples business code from scheduling via Python DAG scripts.

Supports crontab scheduling.

Handles complex branching and conditional triggers.

Offers a complete UI for task status and history.

Lightweight dependencies (DB, RabbitMQ).

After evaluation, Airflow was chosen as the prototype for a distributed task scheduling system due to its comprehensive features and ease of integration with the search ecosystem.

Issues with Native Airflow

Cannot be directly deployed on Drogo due to local state dependencies.

Lacks proper monitoring, requiring integration with Kmon.

No user‑friendly editing interface.

Performance degrades with large task volumes.

Existing bugs in distributed mode.

Maat Architecture

Business Layer

All workflow and timed‑trigger requirements can be created via Maat, which provides a visual editor and rich APIs for building templates with complex branching logic. Current applications include Tisplus, Hawkeye, Kmon, capacity platform, offline component platform, and Opensearch.

Control Layer

Maat adds a control system (Maat Console) on top of native Airflow to lower operational and learning costs. It includes template management, application management, and queue management to isolate resources and control concurrency.

Core Modules

Web API Service : Exposes APIs for task CRUD, status queries, triggering, and retries; also provides the native Airflow web UI.

Scheduler : Determines when tasks run and which nodes are executable, dispatching tasks via MQ or FaaS. Scheduler load is mitigated by splitting per‑business schedulers.

Worker : Executes task nodes; multiple replicas provide scalability and can be configured per queue.

Distributers : Sends tasks from scheduler to workers using Celery + RabbitMQ or the search ecosystem's FaaS framework.

Celery + RabbitMQ

Provides distributed messaging and persistence but conflicts with Drogo’s stateless deployment model, prompting a shift to FaaS.

FaaS

Function‑as‑a‑Service framework from the search ecosystem; tasks are abstracted as functions, enabling lightweight execution, dynamic resource allocation, and automatic scaling.

Base Components

DB: Stores task metadata, history, and node states.

OSS: Persists logs remotely to survive machine migrations.

Kmon: Monitors cluster health and sends alerts.

Drogo: Handles containerized deployment of all Maat nodes.

Platform Advantages

Visual editing and generic node types (Bash, Http, resource‑aware Bash, branch nodes).

Drogo‑based deployment simplifies adding new nodes, scaling, and handling machine migrations.

Cluster management separates workloads by queue, allowing dedicated schedulers and workers per task type.

Comprehensive monitoring and alerting via Kmon and task‑level alerts.

Current Status

Maat serves many internal and cloud scenarios, handling over 3,000 daily scheduled tasks and more than 24,000 daily task executions (as of 2018‑08‑13). The platform continues to scale with new applications.

Future Outlook

Deep integration with Aflow for one‑stop cluster creation, configuration, and deployment.

Richer alerting options and stronger error feedback.

Further optimization of scheduling bottlenecks as task volume grows.

Deeper collaboration with FaaS to create dedicated services for various tasks, improving resource utilization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

workflowtask schedulingAirflowDrogoMaat
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.