How Alibaba’s Custom Three‑Layer Distribution Boosts Scheduled Task Efficiency

This article walks through Alibaba's evolution from single‑machine scheduled jobs to a customized three‑layer distributed task framework, detailing classifications, Spring scheduling examples, batch processing integration, cluster distribution mechanics, and optimization techniques that maximize resource utilization and achieve smooth, balanced task execution.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s Custom Three‑Layer Distribution Boosts Scheduled Task Efficiency

The article introduces Alibaba's "Five Fortune" customized three‑layer distribution task processing framework, starting from single‑machine scheduling and progressively explaining its design and implementation.

Background Introduction

Scheduled tasks are commonly used for batch business processing such as coupon expiration reminders and large‑scale promotional reward distribution. The author participated in the 2023 "Rabbit Year" promotion, leading the development of a system that evenly split a 50‑million‑yuan prize pool using this framework.

Classification of Scheduled Tasks

Scheduled tasks are divided into single‑machine and cluster categories. Single‑machine tasks include simple timed scheduling and timed scheduling combined with batch processing, while cluster tasks encompass three‑layer distribution and the customized Five Fortune three‑layer distribution.

Single‑Machine Tasks

Single‑machine scheduled tasks run on one server and are suitable for low‑volume workloads without sharding. They can be simple timed scheduling or timed scheduling plus batch processing.

Timed Scheduling

In Spring, @Scheduled enables timed tasks using cron expressions or fixed‑rate configurations. Example:

// cron expression
@Scheduled(cron="0 0/30 9-17 * * ?") // every half hour during 9‑17
@Scheduled(cron="0 0 12 ? * WED") // every Wednesday at noon

// fixed‑rate configuration
@Scheduled(fixedRate=5000) // every 5 seconds after previous start
@Scheduled(fixedDelay=3000) // 3 seconds after previous finish
@Scheduled(initialDelay=1000, fixedDelay=2000) // start after 1 s, then every 2 s

Timed scheduling suits simple scenarios like report generation or notifications but struggles with complex, time‑consuming workloads.

Timed Scheduling + Batch Processing

Combining timed scheduling with a batch framework (e.g., Spring Batch) improves efficiency for heavy tasks. Spring Batch splits a job into multiple Steps, each containing an itemReader , itemProcessor , and itemWriter , enabling parallel processing and higher throughput.

Cluster Tasks

When data volume grows and sharding is required, single‑machine scheduling becomes insufficient, prompting the use of a distributed scheduler (Antscheduler) and a three‑layer distribution framework.

Three‑Layer Distribution

The process involves:

Antscheduler injects scheduled messages into the message center.

The message center delivers each message to one machine per Zone.

The receiving machine enters the first layer (Splitor) to obtain the Zone's eid shard range.

Splitor triggers the second layer (Loader) via TR oneway calls, potentially contacting up to 25 machines for a 00‑24 shard range.

Loader fetches data for its assigned eid shard (e.g., 100 records for eid 20).

Loader invokes the third layer (Executor) via TR oneway calls, possibly contacting up to 100 machines.

Executor performs the actual business logic such as state changes or message sending.

While effective, this design has drawbacks: difficulty matching schedule intervals with processing capacity, inability to smooth task execution rates, and under‑utilization of machines due to A/B group restrictions.

Five Fortune Customized Three‑Layer Distribution

To address the three‑layer shortcomings, the Five Fortune team customized the framework with two main goals: maximize cluster resource utilization and achieve smooth task processing.

Maximizing Cluster Resource Utilization

Both A and B groups in each Zone are enabled, and tasks are configured so A‑group machines handle odd‑indexed eids while B‑group machines handle even‑indexed eids. The steps are:

Enable A/B group scheduling in Antscheduler.

Add task configuration to assign odd/even eids to respective groups.

Splitor partitions the full eid range accordingly.

Result: all machines participate in processing, eliminating the previous 50% idle capacity.

/**
 * Filter eids based on dataFlag from configuration
 * 1. ALL – no filtering
 * 2. ODD – only odd table numbers
 * 3. EVEN – only even table numbers
 */
public void filteByDataFlag(List<String> eidList) {
    String dataFlag = SchedulerConfigDrmUtil.getIndexFilterFlag();
    int strategy = -1;
    if (StringUtil.equalsIgnoreCase("ODD", dataFlag)) {
        strategy = 1;
    } else if (StringUtil.equalsIgnoreCase("EVEN", dataFlag)) {
        strategy = 0;
    }
    if (strategy == -1) {
        return; // ALL
    }
    Iterator<String> it = eidList.iterator();
    while (it.hasNext()) {
        String str = it.next();
        int index = NumberUtils.toInt(str, -1);
        if (index % 2 != strategy) {
            it.remove();
        }
    }
}

Smooth Task Processing

Three steps are introduced:

Add task configuration specifying total cluster QPS, schedule interval, and participating machine count.

Calculate per‑machine QPS and the number of tasks each Loader should fetch.

Apply per‑machine rate limiting (Guava RateLimiter) during execution.

/**
 * Compute scheduling configuration for a task
 */
public static SchedulerConfig calculateScheduleConfig(SchedulerConfig scheduleConfig) {
    final int qpsLimit = scheduleConfig.getScheduleWholeLimit();
    final int machineCounts = scheduleConfig.getScheduleMachineCounts();
    final int scheduleRate = scheduleConfig.getScheduleRatePerSec();
    long totalLoaderCounts = qpsLimit * TimeUnit.SECONDS.toSeconds(scheduleRate);
    long loaderCountPerTask = totalLoaderCounts / 1000;
    if (loaderCountPerTask < 1) {
        loaderCountPerTask = 1;
    }
    scheduleConfig.setScheduleLoaderCount((int) loaderCountPerTask);
    final double singleQps = ((double) qpsLimit / machineCounts);
    scheduleConfig.setScheduleSingleLimit(RateLimiter.create(singleQps, 1, TimeUnit.SECONDS));
    return scheduleConfig;
}

The configuration yields two key values: the number of tasks each Loader fetches and the per‑machine QPS limit for Executors. In a real promotion, the cluster fetched 40 000 tasks every 5 seconds, while Executors maintained a stable QPS around 8 000, demonstrating smooth execution.

Conclusion

From single‑machine to cluster and finally to the Five Fortune customized framework, the article outlines the principles, advantages, and limitations of each approach. While the customized three‑layer design improves resource utilization and smoothness, it still relies on manually configured machine counts, which may affect stability in environments with frequent scaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsspringAntscheduler
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.