Ditch the Heavyweight XXL‑Job: An Elegant Nacos‑Based Scheduling Solution

The article analyses the friction between XXL‑Job and a Nacos‑centric stack, proposes the JobFlow design that removes redundant registration, adds full‑link TraceId, strong sharding with distributed locks, intelligent retries and cloud‑native configuration, and demonstrates how these changes simplify operations and improve observability in microservice environments.

java1234
java1234
java1234
Ditch the Heavyweight XXL‑Job: An Elegant Nacos‑Based Scheduling Solution

Background

XXL‑Job is a widely used task‑scheduling framework in China. When a system adopts Nacos for service discovery and configuration (Spring Cloud Alibaba stack), running XXL‑Job alongside introduces duplication and state‑inconsistency problems.

Challenges

Dual registration centers

Each executor reports its status to both XXL‑Job’s own registry and Nacos. If an instance is taken offline in Nacos, XXL‑Job still believes it is online, causing unexpected task execution. Real‑world cases include memory‑leak debugging, network glitches, and gray‑release weight adjustments where the two registries diverge.

Lack of observability

Task execution failures require manually correlating logs from the admin console, the executor, and timestamps because there is no unified TraceId. The logging chain is split across three components, making it difficult to pinpoint the failure point.

Weak sharding guarantees

XXL‑Job’s sharding is advisory; without a distributed lock, two executors may process the same data range, causing duplicate handling. A production case showed an executor restart, XXL‑Job reallocating the shard, and the restarted instance continuing to process the old range.

Core idea – middleware as a business capability

In a cloud‑native architecture, middleware should be embedded as a business capability rather than a separate platform. JobFlow embodies this philosophy by treating the scheduler as a microservice that shares the same deployment, monitoring, configuration, and logging infrastructure as other business services.

JobFlow architecture

JobFlow consists of three core components:

Nacos : unified service discovery and configuration.

JobFlow Scheduler : a lightweight scheduler microservice.

MySQL : stores job definitions, execution records, and audit logs (no service‑registry data).

JobFlow overall architecture diagram
JobFlow overall architecture diagram

Feature 1 – Full‑link TraceId

The scheduler generates a UUID TraceId and propagates it via HTTP headers to the executor, which stores it in MDC. All logs, from scheduler to executor, contain the same TraceId, enabling a single‑step trace in ELK/Loki.

String traceId = UUID.randomUUID().toString();
HttpHeaders headers = new HttpHeaders();
headers.set("X-Trace-Id", traceId);
headers.set("X-Shard-Index", "0");
headers.set("X-Shard-Total", "10");
restTemplate.postForEntity(url, new HttpEntity<>(params, headers), JobResult.class);

Executor side:

@PostMapping("/internal/job/{jobName}")
public JobResult execute(@RequestHeader("X-Trace-Id") String traceId, ...) {
    MDC.put("traceId", traceId);
    try {
        log.info("Start executing job");
        // business logic
        return JobResult.success();
    } finally {
        MDC.clear();
    }
}

Feature 2 – Strong sharding with distributed lock

The scheduler calculates a concrete data range for each shard and creates a Redis lock key. Executors acquire the lock before processing; if the lock is already held, the task is skipped.

int totalRecords = 1000000;
int shardTotal = 10;
int rangeSize = totalRecords / shardTotal;
for (int i = 0; i < shardTotal; i++) {
    long startId = i * rangeSize;
    long endId = (i + 1) * rangeSize - 1;
    String lockKey = String.format("lock:job:order-sync:range:%d-%d", startId, endId);
    // send request with traceId, startId, endId, lockKey
}

Executor handling:

@PostMapping("/internal/job/order-sync")
public JobResult sync(@RequestHeader("X-Start-Id") Long startId,
                       @RequestHeader("X-End-Id") Long endId,
                       @RequestHeader("X-Lock-Key") String lockKey) {
    boolean locked = redisLock.tryLock(lockKey, 60, TimeUnit.SECONDS);
    if (!locked) {
        log.warn("Shard {}-{} already locked", startId, endId);
        return JobResult.skip("Handled by another instance");
    }
    try {
        List<Order> orders = orderDao.findByIdBetween(startId, endId);
        // business processing
        return JobResult.success();
    } finally {
        redisLock.unlock(lockKey);
    }
}

Feature 3 – Intelligent retry with exponential backoff

Retry configuration supports max attempts, exponential backoff, and dead‑letter queue handling.

retry:
  max: 5
  backoff: EXPONENTIAL
  initialDelay: 1s
  maxDelay: 5m

Scheduler retry logic:

public void scheduleRetry(JobExecution execution) {
    int retryCount = execution.getRetryCount();
    if (retryCount >= maxRetry) {
        deadLetterQueue.send(execution);
        return;
    }
    long delay = Math.min(initialDelay * (1 << retryCount), maxDelay);
    scheduler.schedule(() -> retry(execution), delay, TimeUnit.SECONDS);
}

Feature 4 – Cloud‑native config via Nacos

All scheduler parameters (thread pool size, timeouts, retry limits, lock timeout, compensation settings) are stored in a Nacos Config file, allowing dynamic updates without restarting instances.

# jobflow-scheduler.yaml
jobflow:
  scheduler:
    thread-pool-size: 20
    timeout: 300
    max-retry: 3
  executor:
    connect-timeout: 5000
    read-timeout: 30000
  redis:
    lock-timeout: 60
  compensation:
    enabled: true
    interval: 60000
    stuck-threshold: 600000

Feature 5 – Minimalist database schema

Only job definitions and execution records (including TraceId) are persisted. Service‑registry data and scheduler configuration remain in Nacos, reducing DB load and simplifying queries.

CREATE TABLE job_definition (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  job_name VARCHAR(100) UNIQUE,
  service_name VARCHAR(100),
  handler VARCHAR(100),
  cron VARCHAR(100),
  enabled BOOLEAN DEFAULT TRUE,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

CREATE TABLE job_execution (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  job_name VARCHAR(100) NOT NULL,
  trace_id VARCHAR(64) NOT NULL UNIQUE,
  trigger_time TIMESTAMP NOT NULL,
  finish_time TIMESTAMP,
  status VARCHAR(20) NOT NULL,
  retry_count INT DEFAULT 0,
  result_message TEXT,
  INDEX idx_trace (trace_id),
  INDEX idx_job_time (job_name, trigger_time)
);

Operational FAQs

What if Nacos is unavailable?

The scheduler falls back to a local Guava LoadingCache of service instances, allowing continued dispatch while Nacos recovers.

@Service
public class ExecutorDiscovery {
    private LoadingCache<String, List<String>> cache = CacheBuilder.newBuilder()
        .expireAfterWrite(5, TimeUnit.MINUTES)
        .build(key -> namingService.getAllInstances(key));

    public List<String> getInstances(String serviceName) {
        try {
            return namingService.getAllInstances(serviceName);
        } catch (NacosException e) {
            log.warn("Nacos unavailable, using cache");
            return cache.getIfPresent(serviceName);
        }
    }
}

How to handle DB write failures?

JobFlow writes a PENDING record first, then updates to SUCCESS/FAILED asynchronously. A scheduled compensation task scans for stuck executions (e.g., PENDING > 10 min) and reconciles them using TraceId lookup in the log system.

// Write PENDING record
jobExecutionDao.insert(new JobExecution()
    .setTraceId(traceId)
    .setStatus("PENDING")
    .setTriggerTime(now));

// Asynchronously invoke executor and update status
CompletableFuture.runAsync(() -> {
    try {
        JobResult result = executeJob(executor, request);
        jobExecutionDao.updateStatus(traceId, result.getStatus());
    } catch (Exception e) {
        jobExecutionDao.updateStatus(traceId, "FAILED");
    }
});

// Compensation task
@Scheduled(fixedDelay = 60000)
public void fixStuckExecutions() {
    List<JobExecution> stuck = jobExecutionDao.findStuckExecutions();
    // Use traceId to check logs or mark as TIMEOUT
}

How to operate without a UI?

RESTful APIs provide task triggering, history query, detail lookup by TraceId, and retry endpoints. Swagger UI can be added later for a graphical console.

@RestController
@RequestMapping("/api/jobs")
public class JobController {
    @PostMapping("/{name}/trigger")
    public JobResult trigger(@PathVariable String name) {
        return jobService.triggerNow(name);
    }

    @GetMapping("/{name}/executions")
    public Page<JobExecution> history(@PathVariable String name,
                                      @RequestParam int page,
                                      @RequestParam int size) {
        return jobExecutionDao.findByJobName(name, PageRequest.of(page, size));
    }

    @GetMapping("/executions/{traceId}")
    public JobExecution detail(@PathVariable String traceId) {
        return jobExecutionDao.findByTraceId(traceId);
    }

    @PostMapping("/executions/{traceId}/retry")
    public JobResult retry(@PathVariable String traceId) {
        return jobService.retry(traceId);
    }
}

How is high availability achieved?

The scheduler is stateless; multiple instances can run concurrently. Each job acquires a Redis lock before execution, preventing duplicate scheduling. An alternative design uses consistent hashing to assign responsibility to a specific scheduler instance.

@Service
public class JobScheduler {
    @Scheduled(cron = "${job.cron}")
    public void scheduledTrigger() {
        List<JobConfig> jobs = getEnabledJobs();
        for (JobConfig job : jobs) {
            String lockKey = "lock:schedule:" + job.getName();
            boolean locked = redisLock.tryLock(lockKey, 10, TimeUnit.SECONDS);
            if (locked) {
                try {
                    trigger(job);
                } finally {
                    redisLock.unlock(lockKey);
                }
            }
        }
    }
}

// Consistent‑hash variant
public boolean isMyResponsibility(String jobName) {
    int hash = jobName.hashCode();
    List<String> instances = getSchedulerInstances();
    String responsible = consistentHash.get(instances, hash);
    return responsible.equals(myInstanceId);
}

Conclusion

JobFlow is a design exploration rather than a production‑ready replacement for XXL‑Job. Its philosophy—treating middleware as an integral business capability—fits scenarios where teams have deeply integrated Nacos and demand strong observability, unified configuration, and low operational overhead. XXL‑Job remains a solid general‑purpose scheduler, while JobFlow offers a niche, cloud‑native alternative.

cloud-nativemicroservicesshardingtask schedulingNacosXXL-JobTraceIdJobFlow
java1234
Written by

java1234

Former senior programmer at a Fortune Global 500 company, dedicated to sharing Java expertise. Visit Feng's site: Java Knowledge Sharing, www.java1234.com

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.