Ditch the Heavyweight XXL‑Job: An Elegant Nacos‑Based Scheduling Solution
The article analyses the friction between XXL‑Job and a Nacos‑centric stack, proposes the JobFlow design that removes redundant registration, adds full‑link TraceId, strong sharding with distributed locks, intelligent retries and cloud‑native configuration, and demonstrates how these changes simplify operations and improve observability in microservice environments.
Background
XXL‑Job is a widely used task‑scheduling framework in China. When a system adopts Nacos for service discovery and configuration (Spring Cloud Alibaba stack), running XXL‑Job alongside introduces duplication and state‑inconsistency problems.
Challenges
Dual registration centers
Each executor reports its status to both XXL‑Job’s own registry and Nacos. If an instance is taken offline in Nacos, XXL‑Job still believes it is online, causing unexpected task execution. Real‑world cases include memory‑leak debugging, network glitches, and gray‑release weight adjustments where the two registries diverge.
Lack of observability
Task execution failures require manually correlating logs from the admin console, the executor, and timestamps because there is no unified TraceId. The logging chain is split across three components, making it difficult to pinpoint the failure point.
Weak sharding guarantees
XXL‑Job’s sharding is advisory; without a distributed lock, two executors may process the same data range, causing duplicate handling. A production case showed an executor restart, XXL‑Job reallocating the shard, and the restarted instance continuing to process the old range.
Core idea – middleware as a business capability
In a cloud‑native architecture, middleware should be embedded as a business capability rather than a separate platform. JobFlow embodies this philosophy by treating the scheduler as a microservice that shares the same deployment, monitoring, configuration, and logging infrastructure as other business services.
JobFlow architecture
JobFlow consists of three core components:
Nacos : unified service discovery and configuration.
JobFlow Scheduler : a lightweight scheduler microservice.
MySQL : stores job definitions, execution records, and audit logs (no service‑registry data).
Feature 1 – Full‑link TraceId
The scheduler generates a UUID TraceId and propagates it via HTTP headers to the executor, which stores it in MDC. All logs, from scheduler to executor, contain the same TraceId, enabling a single‑step trace in ELK/Loki.
String traceId = UUID.randomUUID().toString();
HttpHeaders headers = new HttpHeaders();
headers.set("X-Trace-Id", traceId);
headers.set("X-Shard-Index", "0");
headers.set("X-Shard-Total", "10");
restTemplate.postForEntity(url, new HttpEntity<>(params, headers), JobResult.class);Executor side:
@PostMapping("/internal/job/{jobName}")
public JobResult execute(@RequestHeader("X-Trace-Id") String traceId, ...) {
MDC.put("traceId", traceId);
try {
log.info("Start executing job");
// business logic
return JobResult.success();
} finally {
MDC.clear();
}
}Feature 2 – Strong sharding with distributed lock
The scheduler calculates a concrete data range for each shard and creates a Redis lock key. Executors acquire the lock before processing; if the lock is already held, the task is skipped.
int totalRecords = 1000000;
int shardTotal = 10;
int rangeSize = totalRecords / shardTotal;
for (int i = 0; i < shardTotal; i++) {
long startId = i * rangeSize;
long endId = (i + 1) * rangeSize - 1;
String lockKey = String.format("lock:job:order-sync:range:%d-%d", startId, endId);
// send request with traceId, startId, endId, lockKey
}Executor handling:
@PostMapping("/internal/job/order-sync")
public JobResult sync(@RequestHeader("X-Start-Id") Long startId,
@RequestHeader("X-End-Id") Long endId,
@RequestHeader("X-Lock-Key") String lockKey) {
boolean locked = redisLock.tryLock(lockKey, 60, TimeUnit.SECONDS);
if (!locked) {
log.warn("Shard {}-{} already locked", startId, endId);
return JobResult.skip("Handled by another instance");
}
try {
List<Order> orders = orderDao.findByIdBetween(startId, endId);
// business processing
return JobResult.success();
} finally {
redisLock.unlock(lockKey);
}
}Feature 3 – Intelligent retry with exponential backoff
Retry configuration supports max attempts, exponential backoff, and dead‑letter queue handling.
retry:
max: 5
backoff: EXPONENTIAL
initialDelay: 1s
maxDelay: 5mScheduler retry logic:
public void scheduleRetry(JobExecution execution) {
int retryCount = execution.getRetryCount();
if (retryCount >= maxRetry) {
deadLetterQueue.send(execution);
return;
}
long delay = Math.min(initialDelay * (1 << retryCount), maxDelay);
scheduler.schedule(() -> retry(execution), delay, TimeUnit.SECONDS);
}Feature 4 – Cloud‑native config via Nacos
All scheduler parameters (thread pool size, timeouts, retry limits, lock timeout, compensation settings) are stored in a Nacos Config file, allowing dynamic updates without restarting instances.
# jobflow-scheduler.yaml
jobflow:
scheduler:
thread-pool-size: 20
timeout: 300
max-retry: 3
executor:
connect-timeout: 5000
read-timeout: 30000
redis:
lock-timeout: 60
compensation:
enabled: true
interval: 60000
stuck-threshold: 600000Feature 5 – Minimalist database schema
Only job definitions and execution records (including TraceId) are persisted. Service‑registry data and scheduler configuration remain in Nacos, reducing DB load and simplifying queries.
CREATE TABLE job_definition (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
job_name VARCHAR(100) UNIQUE,
service_name VARCHAR(100),
handler VARCHAR(100),
cron VARCHAR(100),
enabled BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE job_execution (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
job_name VARCHAR(100) NOT NULL,
trace_id VARCHAR(64) NOT NULL UNIQUE,
trigger_time TIMESTAMP NOT NULL,
finish_time TIMESTAMP,
status VARCHAR(20) NOT NULL,
retry_count INT DEFAULT 0,
result_message TEXT,
INDEX idx_trace (trace_id),
INDEX idx_job_time (job_name, trigger_time)
);Operational FAQs
What if Nacos is unavailable?
The scheduler falls back to a local Guava LoadingCache of service instances, allowing continued dispatch while Nacos recovers.
@Service
public class ExecutorDiscovery {
private LoadingCache<String, List<String>> cache = CacheBuilder.newBuilder()
.expireAfterWrite(5, TimeUnit.MINUTES)
.build(key -> namingService.getAllInstances(key));
public List<String> getInstances(String serviceName) {
try {
return namingService.getAllInstances(serviceName);
} catch (NacosException e) {
log.warn("Nacos unavailable, using cache");
return cache.getIfPresent(serviceName);
}
}
}How to handle DB write failures?
JobFlow writes a PENDING record first, then updates to SUCCESS/FAILED asynchronously. A scheduled compensation task scans for stuck executions (e.g., PENDING > 10 min) and reconciles them using TraceId lookup in the log system.
// Write PENDING record
jobExecutionDao.insert(new JobExecution()
.setTraceId(traceId)
.setStatus("PENDING")
.setTriggerTime(now));
// Asynchronously invoke executor and update status
CompletableFuture.runAsync(() -> {
try {
JobResult result = executeJob(executor, request);
jobExecutionDao.updateStatus(traceId, result.getStatus());
} catch (Exception e) {
jobExecutionDao.updateStatus(traceId, "FAILED");
}
});
// Compensation task
@Scheduled(fixedDelay = 60000)
public void fixStuckExecutions() {
List<JobExecution> stuck = jobExecutionDao.findStuckExecutions();
// Use traceId to check logs or mark as TIMEOUT
}How to operate without a UI?
RESTful APIs provide task triggering, history query, detail lookup by TraceId, and retry endpoints. Swagger UI can be added later for a graphical console.
@RestController
@RequestMapping("/api/jobs")
public class JobController {
@PostMapping("/{name}/trigger")
public JobResult trigger(@PathVariable String name) {
return jobService.triggerNow(name);
}
@GetMapping("/{name}/executions")
public Page<JobExecution> history(@PathVariable String name,
@RequestParam int page,
@RequestParam int size) {
return jobExecutionDao.findByJobName(name, PageRequest.of(page, size));
}
@GetMapping("/executions/{traceId}")
public JobExecution detail(@PathVariable String traceId) {
return jobExecutionDao.findByTraceId(traceId);
}
@PostMapping("/executions/{traceId}/retry")
public JobResult retry(@PathVariable String traceId) {
return jobService.retry(traceId);
}
}How is high availability achieved?
The scheduler is stateless; multiple instances can run concurrently. Each job acquires a Redis lock before execution, preventing duplicate scheduling. An alternative design uses consistent hashing to assign responsibility to a specific scheduler instance.
@Service
public class JobScheduler {
@Scheduled(cron = "${job.cron}")
public void scheduledTrigger() {
List<JobConfig> jobs = getEnabledJobs();
for (JobConfig job : jobs) {
String lockKey = "lock:schedule:" + job.getName();
boolean locked = redisLock.tryLock(lockKey, 10, TimeUnit.SECONDS);
if (locked) {
try {
trigger(job);
} finally {
redisLock.unlock(lockKey);
}
}
}
}
}
// Consistent‑hash variant
public boolean isMyResponsibility(String jobName) {
int hash = jobName.hashCode();
List<String> instances = getSchedulerInstances();
String responsible = consistentHash.get(instances, hash);
return responsible.equals(myInstanceId);
}Conclusion
JobFlow is a design exploration rather than a production‑ready replacement for XXL‑Job. Its philosophy—treating middleware as an integral business capability—fits scenarios where teams have deeply integrated Nacos and demand strong observability, unified configuration, and low operational overhead. XXL‑Job remains a solid general‑purpose scheduler, while JobFlow offers a niche, cloud‑native alternative.
java1234
Former senior programmer at a Fortune Global 500 company, dedicated to sharing Java expertise. Visit Feng's site: Java Knowledge Sharing, www.java1234.com
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
