How Maat Transforms Distributed Workflow Scheduling for Alibaba Search
Maat is a Dag‑based distributed task scheduling platform built on Airflow that centralizes workflow management, adds visual editing, template handling, Drogo‑based deployment, and robust monitoring to meet the complex, asynchronous needs of Alibaba's search middle‑platform.
Background
In the construction of Alibaba's search middle‑platform, a single system can no longer satisfy complex business requirements; multiple subsystems must cooperate asynchronously to complete specific functions, such as configuration sync, monitoring, resource updates, smoke testing, and engine creation, with branching, context passing, and retry mechanisms.
What is Maat?
Maat is a workflow scheduling system based on the open‑source project Airflow. It allows users to assemble custom workflow nodes, trigger processes at specified times (crontab format) or manually, and runs all nodes distributedly on Hippo, scheduled by Drogo. Users can create their own scheduler and executor nodes for resource isolation and configure execution environments and replica counts.
Why Build Maat?
Business code and scheduling code are tightly coupled, making workflow changes invasive.
Lack of a unified management system for scheduling tasks.
Complex multi‑branch workflows and context passing are poorly supported.
No user‑friendly visual UI.
Technical Selection
Common scheduling solutions include internal products D2, Workflow, and open‑source tools Airflow, Quartz, etc.
D2
Dag is recomputed daily, so new or modified tasks only take effect the next day.
No support for workflow context passing.
Limited integration with the search ecosystem (Hippo, Drogo, Kmon).
Workflow
Provides manual triggering and HSF calls but has complex configuration and limited external call support.
Quartz
Java‑based scheduler supporting distributed execution and persistence, but lacks workflow capabilities and requires code coupling.
Airflow
Decouples business code from scheduling via Python DAG scripts.
Supports crontab scheduling.
Handles complex branching and conditional triggers.
Offers a complete UI for task status and history.
Lightweight dependencies (DB, RabbitMQ).
After evaluation, Airflow was chosen as the prototype for a distributed task scheduling system due to its comprehensive features and ease of integration with the search ecosystem.
Issues with Native Airflow
Cannot be directly deployed on Drogo due to local state dependencies.
Lacks proper monitoring, requiring integration with Kmon.
No user‑friendly editing interface.
Performance degrades with large task volumes.
Existing bugs in distributed mode.
Maat Architecture
Business Layer
All workflow and timed‑trigger requirements can be created via Maat, which provides a visual editor and rich APIs for building templates with complex branching logic. Current applications include Tisplus, Hawkeye, Kmon, capacity platform, offline component platform, and Opensearch.
Control Layer
Maat adds a control system (Maat Console) on top of native Airflow to lower operational and learning costs. It includes template management, application management, and queue management to isolate resources and control concurrency.
Core Modules
Web API Service : Exposes APIs for task CRUD, status queries, triggering, and retries; also provides the native Airflow web UI.
Scheduler : Determines when tasks run and which nodes are executable, dispatching tasks via MQ or FaaS. Scheduler load is mitigated by splitting per‑business schedulers.
Worker : Executes task nodes; multiple replicas provide scalability and can be configured per queue.
Distributers : Sends tasks from scheduler to workers using Celery + RabbitMQ or the search ecosystem's FaaS framework.
Celery + RabbitMQ
Provides distributed messaging and persistence but conflicts with Drogo’s stateless deployment model, prompting a shift to FaaS.
FaaS
Function‑as‑a‑Service framework from the search ecosystem; tasks are abstracted as functions, enabling lightweight execution, dynamic resource allocation, and automatic scaling.
Base Components
DB: Stores task metadata, history, and node states.
OSS: Persists logs remotely to survive machine migrations.
Kmon: Monitors cluster health and sends alerts.
Drogo: Handles containerized deployment of all Maat nodes.
Platform Advantages
Visual editing and generic node types (Bash, Http, resource‑aware Bash, branch nodes).
Drogo‑based deployment simplifies adding new nodes, scaling, and handling machine migrations.
Cluster management separates workloads by queue, allowing dedicated schedulers and workers per task type.
Comprehensive monitoring and alerting via Kmon and task‑level alerts.
Current Status
Maat serves many internal and cloud scenarios, handling over 3,000 daily scheduled tasks and more than 24,000 daily task executions (as of 2018‑08‑13). The platform continues to scale with new applications.
Future Outlook
Deep integration with Aflow for one‑stop cluster creation, configuration, and deployment.
Richer alerting options and stronger error feedback.
Further optimization of scheduling bottlenecks as task volume grows.
Deeper collaboration with FaaS to create dedicated services for various tasks, improving resource utilization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
