Cloud Native 19 min read

Ditch the Bulky XXL‑Job? Try This Elegant Nacos‑Based Scheduling Solution

The article analyzes the friction between XXL‑Job and Nacos in cloud‑native environments, proposes the JobFlow design that removes redundant registration and configuration, adds full‑traceability, true sharding with distributed locks, smart retries and cloud‑native configuration, and demonstrates how these changes improve consistency, observability and operational cost.

Java Companion
Java Companion
Java Companion
Ditch the Bulky XXL‑Job? Try This Elegant Nacos‑Based Scheduling Solution

Challenges in the Nacos ecosystem

Inconsistent state across two registries – an executor reports its status to both XXL‑Job and Nacos. If an operator takes the instance offline in Nacos (e.g., for a JVM dump), XXL‑Job still sees it as online and continues dispatching tasks, causing unexpected execution. Network jitter can also make Nacos mark an instance unhealthy while XXL‑Job treats it as healthy, and a restarted instance may be re‑registered in Nacos but remain offline in XXL‑Job.

Observability gap – scheduling and execution are disconnected. Troubleshooting requires consulting admin logs, executor logs, and aligning timestamps manually because there is no unified TraceId.

Weak sharding guarantees – XXL‑Job’s sharding is advisory and lacks a distributed lock. Two executors can process the same data concurrently. A real‑world case: an executor restarts, XXL‑Job marks it offline and reassigns the shard, then the restarted executor resumes processing the original shard, resulting in duplicate handling.

Design principle: middleware as business

In a cloud‑native environment the scheduling capability should be embedded as a business module rather than a separate platform. All infrastructure – service discovery, configuration, monitoring, logging – is shared with the microservice ecosystem, eliminating the need for a standalone XXL‑Job deployment.

Implementation: subtraction and addition

Subtraction (remove redundancy) – discard XXL‑Job’s built‑in registry and use Nacos for service discovery; store only task definitions, execution records and audit logs in MySQL.

Addition (supplement missing capabilities)

Propagate a full‑link TraceId from scheduler to executor and write it into MDC so every log line carries the same identifier.

Provide true sharding: the scheduler computes explicit data ranges and generates a lock key; executors acquire a Redis lock before processing, guaranteeing exclusive handling.

Smart retry with exponential back‑off and a dead‑letter queue.

Scheduler configuration stored in Nacos Config, supporting dynamic updates, multi‑instance sharing and version rollback.

Reuse existing infrastructure (Actuator, Prometheus, alerting, log collection) and expose Prometheus metrics out‑of‑the‑box.

RESTful API for manual trigger, query and retry.

JobFlow architecture

Nacos – unified service discovery and configuration center.

JobFlow Scheduler – lightweight scheduler deployed as a microservice.

MySQL – persists task definitions, execution metadata and audit logs.

The scheduler generates a global traceId and passes it via HTTP headers to the executor, which puts it into MDC. Searching the traceId in ELK reveals the complete execution chain.

String traceId = UUID.randomUUID().toString();
HttpHeaders headers = new HttpHeaders();
headers.set("X-Trace-Id", traceId);
headers.set("X-Shard-Index", "0");
headers.set("X-Shard-Total", "10");
restTemplate.postForEntity(url, new HttpEntity<>(params, headers), JobResult.class);
@PostMapping("/internal/job/{jobName}")
public JobResult execute(@RequestHeader("X-Trace-Id") String traceId, ...) {
    MDC.put("traceId", traceId);
    try {
        log.info("Start executing job");
        // business logic
        return JobResult.success();
    } finally {
        MDC.clear();
    }
}

Feature 1 – Full‑link TraceId

All logs, scheduler events and executor actions share the same traceId, enabling end‑to‑end tracing in ELK.

Feature 2 – True sharding

The scheduler computes explicit ranges and lock keys, then dispatches requests containing startId, endId and lockKey. Executors acquire the Redis lock before processing.

int totalRecords = 1000000;
int shardTotal = 10;
int rangeSize = totalRecords / shardTotal;
for (int i = 0; i < shardTotal; i++) {
    long startId = i * rangeSize;
    long endId = (i + 1) * rangeSize - 1;
    String lockKey = String.format("lock:job:order-sync:range:%d-%d", startId, endId);
    JobRequest request = new JobRequest();
    request.setTraceId(traceId);
    request.setStartId(startId);
    request.setEndId(endId);
    request.setLockKey(lockKey);
    executeAsync(instance, request);
}
@PostMapping("/internal/job/order-sync")
public JobResult sync(@RequestHeader("X-Start-Id") Long startId,
                      @RequestHeader("X-End-Id") Long endId,
                      @RequestHeader("X-Lock-Key") String lockKey) {
    boolean locked = redisLock.tryLock(lockKey, 60, TimeUnit.SECONDS);
    if (!locked) {
        log.warn("Shard {}-{} already locked", startId, endId);
        return JobResult.skip("Already processed by another instance");
    }
    try {
        List<Order> orders = orderDao.findByIdBetween(startId, endId);
        // business processing
        return JobResult.success();
    } finally {
        redisLock.unlock(lockKey);
    }
}

Feature 3 – Smart retry

Failures trigger exponential back‑off retries up to a configurable maximum; beyond that the task is sent to a dead‑letter queue.

retry:
  max: 5
  backoff: EXPONENTIAL
  initialDelay: 1s
  maxDelay: 5m

public void scheduleRetry(JobExecution execution) {
    int retryCount = execution.getRetryCount();
    if (retryCount >= maxRetry) {
        deadLetterQueue.send(execution);
        return;
    }
    long delay = Math.min(initialDelay << retryCount, maxDelay);
    scheduler.schedule(() -> retry(execution), delay, TimeUnit.SECONDS);
}

Feature 4 – Cloud‑native scheduler configuration

All scheduler parameters are stored in Nacos Config, allowing dynamic updates without restart and version rollback.

# jobflow-scheduler.yaml
jobflow:
  scheduler:
    thread-pool-size: 20
    timeout: 300
    max-retry: 3
  executor:
    connect-timeout: 5000
    read-timeout: 30000
  redis:
    lock-timeout: 60
  compensation:
    enabled: true
    interval: 60000
    stuck-threshold: 600000

Operators can change thread-pool-size in the Nacos console; the new value takes effect immediately across all scheduler instances.

Feature 5 – Streamlined database design

Only task definitions, execution metadata and audit logs are persisted. Service registration and scheduler configuration remain in Nacos; detailed execution logs are retrieved via traceId from ELK.

CREATE TABLE job_definition (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  job_name VARCHAR(100) UNIQUE,
  service_name VARCHAR(100),
  handler VARCHAR(100),
  cron VARCHAR(100),
  enabled BOOLEAN DEFAULT TRUE,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

CREATE TABLE job_execution (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  job_name VARCHAR(100) NOT NULL,
  trace_id VARCHAR(64) NOT NULL UNIQUE,
  trigger_time TIMESTAMP NOT NULL,
  finish_time TIMESTAMP,
  status VARCHAR(20) NOT NULL,
  retry_count INT DEFAULT 0,
  result_message TEXT,
  INDEX idx_trace (trace_id),
  INDEX idx_job_time (job_name, trigger_time)
);

FAQ

What if Nacos goes down?

If Nacos is unavailable the whole microservice system is affected, but scheduling is not the highest priority. A fallback cache of instance lists keeps dispatching temporarily.

@Service
public class ExecutorDiscovery {
    private LoadingCache<String, List<String>> cache = CacheBuilder.newBuilder()
        .expireAfterWrite(5, TimeUnit.MINUTES)
        .build(key -> namingService.getAllInstances(key));

    public List<String> getInstances(String serviceName) {
        try {
            return namingService.getAllInstances(serviceName);
        } catch (NacosException e) {
            log.warn("Nacos unavailable, using cache");
            return cache.getIfPresent(serviceName);
        }
    }
}

How to handle DB write failures causing inconsistency?

A final‑consistency model is used: write a PENDING record first, invoke the executor asynchronously, then update the status to SUCCESS or FAILED. A background compensation task scans stale PENDING records and reconciles them via traceId.

// Insert PENDING record
jobExecutionDao.insert(new JobExecution()
    .setTraceId(traceId)
    .setStatus("PENDING")
    .setTriggerTime(now));

// Async executor call
CompletableFuture.runAsync(() -> {
    try {
        JobResult result = executeJob(executor, request);
        jobExecutionDao.updateStatus(traceId, result.getStatus());
    } catch (Exception e) {
        jobExecutionDao.updateStatus(traceId, "FAILED");
    }
});

// Compensation task
@Scheduled(fixedDelay = 60000)
public void fixStuckExecutions() {
    List<JobExecution> stuck = jobExecutionDao.findStuckExecutions();
    for (JobExecution exec : stuck) {
        // Use traceId to check logs or mark as TIMEOUT
    }
}

No UI for operations?

Initially provide RESTful APIs for triggering, querying history and retrying tasks. Swagger UI or a custom front‑end can be added later.

@RestController
@RequestMapping("/api/jobs")
public class JobController {
    @PostMapping("/{name}/trigger")
    public JobResult trigger(@PathVariable String name) {
        return jobService.triggerNow(name);
    }
    @GetMapping("/{name}/executions")
    public Page<JobExecution> history(@PathVariable String name,
                                      @RequestParam int page,
                                      @RequestParam int size) {
        return jobExecutionDao.findByJobName(name, PageRequest.of(page, size));
    }
    @GetMapping("/executions/{traceId}")
    public JobExecution detail(@PathVariable String traceId) {
        return jobExecutionDao.findByTraceId(traceId);
    }
    @PostMapping("/executions/{traceId}/retry")
    public JobResult retry(@PathVariable String traceId) {
        return jobService.retry(traceId);
    }
}

How to ensure scheduler high availability?

The scheduler is stateless; multiple instances can run concurrently. Distributed locks prevent duplicate scheduling, and consistent hashing can assign responsibility for specific jobs to particular instances.

@Scheduled(cron = "${job.cron}")
public void scheduledTrigger() {
    List<JobConfig> jobs = getEnabledJobs();
    for (JobConfig job : jobs) {
        String lockKey = "lock:schedule:" + job.getName();
        boolean locked = redisLock.tryLock(lockKey, 10, TimeUnit.SECONDS);
        if (locked) {
            try {
                trigger(job);
            } finally {
                redisLock.unlock(lockKey);
            }
        }
    }
}

public boolean isMyResponsibility(String jobName) {
    int hash = jobName.hashCode();
    List<String> instances = getSchedulerInstances();
    String responsible = consistentHash.get(instances, hash);
    return responsible.equals(myInstanceId);
}
cloud-nativeMicroservicesshardingtask schedulingNacosXXL-JobTraceIdJobFlow
Java Companion
Written by

Java Companion

A highly professional Java public account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.