Why Rebuild a Job Scheduler? Inside a Lightweight Distributed Timing Framework

This article explains the motivation, design choices, and implementation details of a custom distributed job scheduling framework, covering its architecture, load‑balancing strategy, message‑queue handling, persistence mechanisms, and key code snippets, while comparing it to existing solutions like Quartz, XXL‑Job, and PowerJob.

Architect
Architect
Architect
Why Rebuild a Job Scheduler? Inside a Lightweight Distributed Timing Framework

Project Background

Existing scheduling frameworks (Quartz, XXL‑Job, PowerJob) cannot satisfy a scenario that requires frequent creation and dynamic modification of tasks in a distributed environment. Specific limitations include:

MQ delay queues cannot adjust task parameters dynamically.

Redis expiration needs long‑lived keys and may cause big‑key issues.

XXL‑Job lacks a native OpenAPI and its DB‑lock scheduling focuses on high availability rather than performance.

PowerJob’s HTTP‑based synchronous API and group‑isolation design make high‑concurrency scheduling inefficient.

To retain full control, trim unnecessary features, and study mainstream design ideas, a new scheduling framework was built by refactoring PowerJob.

Positioning

The framework extends PowerJob with the following capabilities:

Lightweight API and an internal asynchronous message queue for frequent task creation and dynamic parameter changes.

Load‑balanced execution of massive concurrent tasks (group isolation + application‑level locks).

Targeted at small tasks, requiring minimal configuration and no direct manipulation of task instances.

Technical Selection

Communication : gRPC (Netty NIO)
Serialization : Protobuf
Load Balancing : Custom NameServer
  ├─ Strategy : Minimum schedule count per server
  └─ Interaction : pull + push
Message Queue : Simple custom MQ
  ├─ Send : async + timeout retry
  ├─ Persistence : mmap + sync flush
  └─ Retry : multi‑level delay queue + dead‑letter queue
Scheduling : Time‑wheel algorithm

Project Structure

├── LICENSE
├── k-job-common            // shared dependencies
├── k-job-nameServer        // registration center for load balancing
├── k-job-producer          // OpenAPI jar with async MQ sending
├── k-job-server           // SpringBoot scheduling server
├── k-job-worker-boot-starter // Spring Boot starter for workers
├── k-job-worker           // worker jar required by applications
└── pom.xml

Key Features

Load Balancing (massive concurrent execution)

The system introduces a NameServer that records each server’s schedule count and distributes tasks based on a “minimum‑schedule‑times” strategy. When a worker pulls tasks, it checks the NameServer; if the worker count exceeds a threshold or the server’s schedule count is more than twice the minimum, the task is reassigned to a less‑loaded server. This avoids OOM risks from pulling too many tasks, thread‑pool bottlenecks, and single‑server request concentration.

Future work may add weight‑based server selection to consider geographic latency.

Message Queue (handling frequent task creation/modification)

Instead of PowerJob’s synchronous HTTP API, the framework uses an internal asynchronous MQ. Producers send messages via a custom futureStub that returns immediately; the server acknowledges later, eliminating client‑side blocking.

Reliability is achieved by merging the broker and server: each server maintains its own persistent queue, removing the need for a separate broker layer.

Producer‑to‑server loss is mitigated by retrying across all servers.

Server‑side loss is prevented by mmap‑based persistence with synchronous flush.

Persistence

The commit log and consumer queue are memory‑mapped files similar to RocketMQ. Incoming messages are first written to a buffer, then flushed to disk before the response callback is executed.

<MappedByteBuffer> commitLogBuffer;   // mmap of commit log file
<MappedByteBuffer> consumerQueueBuffer; // mmap of consumer queue file
AtomicLong commitLogBufferPosition = new AtomicLong(0);
AtomicLong commitLogCurPosition = new AtomicLong(0);
AtomicLong lastProcessedOffset = new AtomicLong(0);
AtomicLong currentConsumerQueuePosition = new AtomicLong(0);
AtomicLong consumerPosition = new AtomicLong(0);

Message Retry

Producers retry on timeout or error by selecting the next server. Consumers use a multi‑level delay queue; failed messages move to a longer‑delay queue, and after exhausting retries they are placed in a dead‑letter queue for manual intervention.

static final Deque<MqCausa.Message> deadMessageQueue = new ArrayDeque<>();
static final List<DelayQueue<DelayedMessage>> delayQueueList = new ArrayList<>(2);
static final List<Long> delayTimes = Lists.newArrayList(10000L, 5000L);

public static void init(Consumer consumer) {
    delayQueueList.add(new DelayQueue<>());
    delayQueueList.add(new DelayQueue<>());
    Thread consumerThread = new Thread(() -> {
        try {
            while (true) {
                DelayQueue<DelayedMessage> delayQueue = delayQueueList.get(0);
                if (!delayQueue.isEmpty()) {
                    DelayedMessage msg = delayQueue.take();
                    consumer.consume(msg.message);
                    delayQueue.remove(msg);
                    System.out.println("Consumed: " + msg.getMessage() + " at " + System.currentTimeMillis());
                }
            }
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            System.out.println("Consumer thread interrupted");
        }
    });
    consumerThread.start();
}

public static void reConsume(MqCausa.Message msg) {
    if (msg.getRetryTime() == 0) {
        log.error("msg : {} is dead", msg);
        deadMessageQueue.add(msg);
        return;
    }
    MqCausa.Message next = msg.toBuilder().setRetryTime(msg.getRetryTime() - 1).build();
    DelayedMessage delayed = new DelayedMessage(next, delayTimes.get(next.getRetryTime()));
    delayQueueList.get(msg.getRetryTime() - 1).add(delayed);
}

static class DelayedMessage implements Delayed {
    private final MqCausa.Message message;
    private final long triggerTime;
    DelayedMessage(MqCausa.Message message, long delay) {
        this.message = message;
        this.triggerTime = System.currentTimeMillis() + delay;
    }
    @Override public long getDelay(TimeUnit unit) {
        return unit.convert(triggerTime - System.currentTimeMillis(), TimeUnit.MILLISECONDS);
    }
    @Override public int compareTo(Delayed other) {
        return Long.compare(this.triggerTime, ((DelayedMessage) other).triggerTime);
    }
    public MqCausa.Message getMessage() { return message; }
}

Implementation Overview

The final architecture combines asynchronous task submission, load‑balanced server selection, and reliable persistence, achieving higher throughput and lower latency compared to XXL‑Job’s global lock and PowerJob’s single‑server scheduling.

Architecture Overview
Architecture Overview

Key functional outcomes:

Asynchronous task operation requests.

Polling‑based load‑balanced consumption.

Additional Diagrams

Service discovery and scheduling flowcharts illustrate worker‑server interaction and the NameServer‑driven load‑balancing process.

Service Discovery
Service Discovery
Scheduling Flow
Scheduling Flow
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsJavaload balancinggRPCMessage Queuejob scheduler
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.