Why Rebuild a Job Scheduler? Inside a Lightweight Distributed Timing Framework
This article explains the motivation, design choices, and implementation details of a custom distributed job scheduling framework, covering its architecture, load‑balancing strategy, message‑queue handling, persistence mechanisms, and key code snippets, while comparing it to existing solutions like Quartz, XXL‑Job, and PowerJob.
Project Background
Existing scheduling frameworks (Quartz, XXL‑Job, PowerJob) cannot satisfy a scenario that requires frequent creation and dynamic modification of tasks in a distributed environment. Specific limitations include:
MQ delay queues cannot adjust task parameters dynamically.
Redis expiration needs long‑lived keys and may cause big‑key issues.
XXL‑Job lacks a native OpenAPI and its DB‑lock scheduling focuses on high availability rather than performance.
PowerJob’s HTTP‑based synchronous API and group‑isolation design make high‑concurrency scheduling inefficient.
To retain full control, trim unnecessary features, and study mainstream design ideas, a new scheduling framework was built by refactoring PowerJob.
Positioning
The framework extends PowerJob with the following capabilities:
Lightweight API and an internal asynchronous message queue for frequent task creation and dynamic parameter changes.
Load‑balanced execution of massive concurrent tasks (group isolation + application‑level locks).
Targeted at small tasks, requiring minimal configuration and no direct manipulation of task instances.
Technical Selection
Communication : gRPC (Netty NIO)
Serialization : Protobuf
Load Balancing : Custom NameServer
├─ Strategy : Minimum schedule count per server
└─ Interaction : pull + push
Message Queue : Simple custom MQ
├─ Send : async + timeout retry
├─ Persistence : mmap + sync flush
└─ Retry : multi‑level delay queue + dead‑letter queue
Scheduling : Time‑wheel algorithmProject Structure
├── LICENSE
├── k-job-common // shared dependencies
├── k-job-nameServer // registration center for load balancing
├── k-job-producer // OpenAPI jar with async MQ sending
├── k-job-server // SpringBoot scheduling server
├── k-job-worker-boot-starter // Spring Boot starter for workers
├── k-job-worker // worker jar required by applications
└── pom.xmlKey Features
Load Balancing (massive concurrent execution)
The system introduces a NameServer that records each server’s schedule count and distributes tasks based on a “minimum‑schedule‑times” strategy. When a worker pulls tasks, it checks the NameServer; if the worker count exceeds a threshold or the server’s schedule count is more than twice the minimum, the task is reassigned to a less‑loaded server. This avoids OOM risks from pulling too many tasks, thread‑pool bottlenecks, and single‑server request concentration.
Future work may add weight‑based server selection to consider geographic latency.
Message Queue (handling frequent task creation/modification)
Instead of PowerJob’s synchronous HTTP API, the framework uses an internal asynchronous MQ. Producers send messages via a custom futureStub that returns immediately; the server acknowledges later, eliminating client‑side blocking.
Reliability is achieved by merging the broker and server: each server maintains its own persistent queue, removing the need for a separate broker layer.
Producer‑to‑server loss is mitigated by retrying across all servers.
Server‑side loss is prevented by mmap‑based persistence with synchronous flush.
Persistence
The commit log and consumer queue are memory‑mapped files similar to RocketMQ. Incoming messages are first written to a buffer, then flushed to disk before the response callback is executed.
<MappedByteBuffer> commitLogBuffer; // mmap of commit log file
<MappedByteBuffer> consumerQueueBuffer; // mmap of consumer queue file
AtomicLong commitLogBufferPosition = new AtomicLong(0);
AtomicLong commitLogCurPosition = new AtomicLong(0);
AtomicLong lastProcessedOffset = new AtomicLong(0);
AtomicLong currentConsumerQueuePosition = new AtomicLong(0);
AtomicLong consumerPosition = new AtomicLong(0);Message Retry
Producers retry on timeout or error by selecting the next server. Consumers use a multi‑level delay queue; failed messages move to a longer‑delay queue, and after exhausting retries they are placed in a dead‑letter queue for manual intervention.
static final Deque<MqCausa.Message> deadMessageQueue = new ArrayDeque<>();
static final List<DelayQueue<DelayedMessage>> delayQueueList = new ArrayList<>(2);
static final List<Long> delayTimes = Lists.newArrayList(10000L, 5000L);
public static void init(Consumer consumer) {
delayQueueList.add(new DelayQueue<>());
delayQueueList.add(new DelayQueue<>());
Thread consumerThread = new Thread(() -> {
try {
while (true) {
DelayQueue<DelayedMessage> delayQueue = delayQueueList.get(0);
if (!delayQueue.isEmpty()) {
DelayedMessage msg = delayQueue.take();
consumer.consume(msg.message);
delayQueue.remove(msg);
System.out.println("Consumed: " + msg.getMessage() + " at " + System.currentTimeMillis());
}
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
System.out.println("Consumer thread interrupted");
}
});
consumerThread.start();
}
public static void reConsume(MqCausa.Message msg) {
if (msg.getRetryTime() == 0) {
log.error("msg : {} is dead", msg);
deadMessageQueue.add(msg);
return;
}
MqCausa.Message next = msg.toBuilder().setRetryTime(msg.getRetryTime() - 1).build();
DelayedMessage delayed = new DelayedMessage(next, delayTimes.get(next.getRetryTime()));
delayQueueList.get(msg.getRetryTime() - 1).add(delayed);
}
static class DelayedMessage implements Delayed {
private final MqCausa.Message message;
private final long triggerTime;
DelayedMessage(MqCausa.Message message, long delay) {
this.message = message;
this.triggerTime = System.currentTimeMillis() + delay;
}
@Override public long getDelay(TimeUnit unit) {
return unit.convert(triggerTime - System.currentTimeMillis(), TimeUnit.MILLISECONDS);
}
@Override public int compareTo(Delayed other) {
return Long.compare(this.triggerTime, ((DelayedMessage) other).triggerTime);
}
public MqCausa.Message getMessage() { return message; }
}Implementation Overview
The final architecture combines asynchronous task submission, load‑balanced server selection, and reliable persistence, achieving higher throughput and lower latency compared to XXL‑Job’s global lock and PowerJob’s single‑server scheduling.
Key functional outcomes:
Asynchronous task operation requests.
Polling‑based load‑balanced consumption.
Additional Diagrams
Service discovery and scheduling flowcharts illustrate worker‑server interaction and the NameServer‑driven load‑balancing process.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
