Backend Development 14 min read

High Availability for Elastic Job Lite: Active‑Standby and Dual‑Data‑Center Design

This article explains how to transform single‑node Elastic Job Lite deployments into highly available solutions, covering Zookeeper‑based sharding, active‑standby strategies for dual‑data‑center setups, custom sharding implementations, and priority scheduling to ensure tasks run reliably across both primary and backup sites.

Programmer DD

Aug 2, 2021

High Availability for Elastic Job Lite: Active‑Standby and Dual‑Data‑Center Design

When using Elastic Job Lite for scheduled tasks, many teams deploy a single instance, which is risky for high‑availability scenarios such as financial interest updates. Elastic Job Lite actually supports HA, but most public articles focus on the cloud version. This guide explains the core principles and extends them to a same‑city dual‑data‑center architecture.

From Single Node to High Availability

Typical deployments look like the diagram below:

Developers often worry that a scheduled job might be triggered multiple times simultaneously, but this stems from a misunderstanding of the framework. The official documentation already states that one of the basic features is:

Job sharding consistency, ensuring that the same shard is executed by only one instance in a distributed environment.

Elastic Job relies on Zookeeper to elect an instance for each shard, guaranteeing that only one instance runs a given shard (if sharding count is zero, the job runs on a single instance).

The architecture shown above is therefore valid: a service is called by only one instance, and if that instance fails, others take over, achieving high availability.

Dual‑Data‑Center High Availability

As internet services grow, higher HA requirements emerge. A same‑city two‑data‑center deployment might look like this:

If the A‑site becomes completely unavailable, the B‑site can take over, and because the cluster is shared, Elastic Job ensures that a shard runs on only one instance across both sites.

Note: this article does not discuss how Zookeeper itself achieves cross‑site HA; a two‑site Zookeeper cluster alone cannot guarantee it.

Priority Scheduling?

In production, scheduled jobs often depend on a data source that has a primary‑secondary relationship (e.g., primary in A‑site, replica in B‑site). Read‑only jobs can simply connect to the local replica, but write‑heavy jobs suffer latency if they are scheduled to B‑site while writing to the primary in A‑site.

To achieve the following goals:

Both data centers remain available at any time; if one fails, the other provides equivalent service.

A specific job can be forced to run in A‑site.

Elastic Job Sharding Strategy

Elastic Job provides several built‑in sharding strategies (average allocation, hash‑based IP ordering, round‑robin) and also allows custom strategies by implementing the JobShardingStrategy interface and its sharding method.

public Map<JobInstance, List<Integer>> sharding(List<JobInstance> jobInstances, String jobName, int shardingTotalCount)

By creating a custom strategy that knows which instances belong to A‑site and which to B‑site, we can prioritize A‑site during sharding, effectively achieving locality to the data source.

Implementing Same‑City Dual‑Data‑Center Priority Scheduling

Below is a simple example where instances whose IP is in a whitelist are considered active; all others are treated as standby.

1. Extend the decorator strategy to specify standby instances

public abstract class JobShardingStrategyActiveStandbyDecorator implements JobShardingStrategy {
    private JobShardingStrategy inner = new AverageAllocationJobShardingStrategy();

    /**
     * Determine whether an instance is standby. If both active and standby exist, the standby will be removed before sharding.
     */
    protected abstract boolean isStandby(JobInstance jobInstance, String jobName);

    @Override
    public Map<JobInstance, List<Integer>> sharding(List<JobInstance> jobInstances, String jobName, int shardingTotalCount) {
        List<JobInstance> candidates = new ArrayList<>(jobInstances);
        List<JobInstance> removed = new ArrayList<>();
        boolean removeSelf = false;
        for (JobInstance jobInstance : jobInstances) {
            boolean standby = false;
            try { standby = isStandby(jobInstance, jobName); } catch (Exception e) { log.warn("isStandby throws error, consider as not standby", e); }
            if (standby) {
                if (IpUtils.getIp().equals(jobInstance.getIp())) { removeSelf = true; }
                candidates.remove(jobInstance);
                removed.add(jobInstance);
            }
        }
        if (candidates.isEmpty()) { candidates = jobInstances; }
        candidates.sort((o1, o2) -> o1.getJobInstanceId().compareTo(o2.getJobInstanceId()));
        return inner.sharding(candidates, jobName, shardingTotalCount);
    }
}

2. Define a concrete strategy that marks non‑whitelisted IPs as standby

public class ActiveStandbyESJobStrategy extends JobShardingStrategyActiveStandbyDecorator {
    @Override
    protected boolean isStandby(JobInstance jobInstance, String jobName) {
        String activeIps = "10.10.10.1,10.10.10.2"; // only these IPs are active
        return !Arrays.asList(activeIps.split(",")).contains(jobInstance.getIp());
    }
}

This simple class makes the scheduler prefer the specified IPs.

3. Register the custom strategy before job startup

JobCoreConfiguration core = JobCoreConfiguration.newBuilder(jobClass.getName(), cron, shardingTotalCount)
    .shardingItemParameters(shardingItemParameters).build();
SimpleJobConfiguration jobConfig = new SimpleJobConfiguration(core, jobClass.getCanonicalName());
return LiteJobConfiguration.newBuilder(jobConfig)
    .jobShardingStrategyClass("com.xxx.yyy.job.ActiveStandbyESJobStrategy")
    .build();

Same‑City Active‑Active Mode

After applying the above changes, two problems are solved:

Scheduled jobs achieve high availability across both data centers.

Jobs can be prioritized to run in a designated data center.

In this mode, B‑site acts as a backup because A‑site is always preferred. However, if A‑site fails, B‑site must be able to handle the load, which may require additional validation (e.g., database permissions).

To move from active‑standby to true active‑active, we can let a portion of traffic run in B‑site (e.g., 10%). The sharding interface provides a full view of all instances, the job name, and the total shard count, enabling such fine‑grained control.

Assign specific tasks to B‑site as priority (e.g., read‑only tasks).

During sharding, allocate the last few shards (e.g., 1/10) to B‑site.

Both approaches allow traffic to be distributed across A and B, achieving an active‑active configuration.

Example: Task‑Specific Active IPs

public class ActiveStandbyESJobStrategy extends JobShardingStrategyActiveStandbyDecorator {
    @Override
    protected boolean isStandby(JobInstance jobInstance, String jobName) {
        String activeIps = "10.10.10.1,10.10.10.2"; // default active IPs
        if ("TASK_B_FIRST".equals(jobName)) {
            activeIps = "10.11.10.1,10.11.10.2"; // prioritize B‑site for this task
        }
        return !Arrays.asList(activeIps.split(",")).contains(jobInstance.getIp());
    }
}

With this customization, each scheduled job can be directed to the most appropriate data center while still providing failover capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems java high availability Job Scheduling Elastic-Job

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.