Operations 22 min read

How Alibaba Cloud Log Service Scales Billion‑Task Scheduling: Design and Practice

This article explains how Alibaba Cloud Log Service implements a billion‑scale task scheduling framework for its observability platform, covering background, task types, design goals, architecture, key design points, and practical examples such as aggregation jobs and various scheduling scenarios.

ITPUB
ITPUB
ITPUB
How Alibaba Cloud Log Service Scales Billion‑Task Scheduling: Design and Practice

Background

Alibaba Cloud Log Service is a cloud‑native observability and analysis platform that provides one‑stop data collection, processing, query, visualization, alerting, and delivery, enhancing digital capabilities for development, operations, security, and more. The article introduces the design and practice of its massive‑scale task scheduling framework.

Task Scheduling Basics

Scheduling is a common technique in operating systems, big‑data engines, and custom task frameworks. It typically assigns tasks (processes, pods, sub‑tasks) to resources (CPU, thread‑pool, nodes) and must handle failures, migrations, and load balancing.

Operating system: Linux kernel uses time‑slicing and priority; Kubernetes schedules Pods to Nodes based on scoring.

Big‑data analysis: MapReduce uses a fair scheduler, Presto assigns query stages to workers, Spark splits jobs into stages and tasks.

Task framework: ETL and cron‑like jobs require orchestration and state consistency.

Task Types

Two main categories are highlighted:

Timed tasks – executed at fixed intervals or cron expressions (e.g., hourly log monitoring, dashboard subscriptions).

Dependency tasks – require ordering, such as pipeline or DAG execution where later tasks depend on earlier ones.

Observability Platform Data Flow

The platform can be abstracted into four generic stages:

Data ingestion – logs, Kafka, OSS, cloud monitoring, databases, etc.

Data processing – cleaning, transformation, rolling‑up, enrichment.

Data monitoring – metric collection, anomaly detection, alerting.

Data export – long‑term storage or downstream delivery.

These stages map to concrete tasks such as log ingestion, ETL, alert generation, dashboard subscription, and data export.

Design Goals for the Scheduling Framework

Support heterogeneous tasks (alerts, dashboard subscriptions, data processing, aggregation, etc.).

Handle massive task volumes (millions of executions per day) with linear horizontal scaling.

Provide high availability during upgrades, restarts, or failures.

Offer simple, efficient operations and visual dashboards.

Ensure strict multi‑tenant isolation.

Maintain extensibility for new task types via plugins.

Expose API/SDK/Terraform for programmatic task management.

Overall Architecture

The system consists of several layers:

Storage layer – relational database for task metadata; distributed file system for runtime state and snapshots.

Service layer – master‑worker framework. The master partitions tasks, assigns partitions to workers, and the workers schedule and execute tasks via a scheduler and job executor. State is persisted in a redo log and periodic snapshots.

Business layer – UI features such as alert monitoring, data processing, index rebuilding, data import, intelligent inspection, and log delivery.

Access layer – Nginx/CGI entry points with high availability and regional deployment.

API/SDK/Terraform/Console – user‑facing interfaces for task CRUD, custom UI per task type, and automation.

Task visualization – dashboards showing execution history, status, and built‑in alerts.

Key Design Points

Heterogeneous task model abstraction.

Scheduling service framework based on master‑worker and partitions.

Large‑scale task support through dynamic partition‑to‑worker mapping.

High‑availability design using redo logs and snapshot merging.

Stability construction covering release workflow, gray‑scale rollout, internal ops APIs, on‑call inspection, and comprehensive task monitoring.

Task Model Abstraction

Timed tasks – support cron expressions and periodic execution.

Resident tasks – long‑running jobs such as continuous data processing or index rebuilding.

Dry‑run / on‑demand tasks – created for preview or ad‑hoc execution and terminate after completion.

Scheduling Service Framework

The master creates partitions and distributes them to workers. Workers pull tasks belonging to their partitions, the scheduler composes them (handling timed, resident, and on‑demand jobs), and the job executor runs them. Execution state is written to a redo log; snapshots are periodically merged to keep the log size manageable.

Large‑Scale Support

Partitions act as an indirection layer: adding more workers automatically balances more partitions, enabling linear scaling without redesign.

High‑Availability Mechanism

Redo logs persist task state across worker restarts and migrations. Snapshots combine older redo entries, reducing load time when a worker takes over a partition.

Stability Measures

Web‑based release pipeline with template‑driven version tracking and rollback.

Gray‑scale rollout per cluster or task type for safe deployments.

Internal ops APIs for automatic job repair and minimal manual intervention.

On‑call inspection tools that detect long‑running or stalled tasks and generate alerts.

User‑side monitoring dashboards with configurable alerts; service‑side dashboards for cluster‑wide task health.

Example: Aggregation Task

An aggregation job periodically queries a time window, writes results to another store, and provides high‑density information. The article shows execution status (rows processed, data volume, success/failure) and describes five scheduling scenarios:

Instance delay execution – catch‑up logic reduces lag over time.

Backfill from a historical point – creates supplemental instances until current progress is reached.

Fixed‑time‑window execution – stops generating instances after the end time.

Configuration change impact – next instance follows the new schedule; recommend synchronizing SQL windows and frequencies.

Retry mechanisms – automatic retries for permission, missing source/target, or syntax errors; manual retries with a configurable maximum.

Command‑Line Example for Log Analysis

cat nginx_access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -10 | more

Using a distributed log platform, the same query can be expressed succinctly with SQL:

select url, count(1) as cnt from log group by url order by cnt desc limit 10

Images

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

task schedulinglarge scaleLog Servicemaster-worker
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.