How to Build a Reliable Java Delayed Queue for Scalable Backend Systems
This article explains the design, architecture, and implementation details of a Java-based delayed queue, covering use cases, core components, message lifecycle, protocol, current topology, shortcomings, and future improvements for reliable backend processing.
Delayed queue is a message queue with delay capability, useful for scenarios such as automatically closing unpaid orders, periodically checking refund status, or sending activation SMS for newly created stores that have not uploaded products.
Design Goals
Message reliability: each message must be consumed at least once.
Rich client support: at least PHP and Python.
High availability: support multi‑instance deployment with failover.
Timeliness: allow a certain time deviation.
Message deletion: users can delete specific messages at any time.
Overall Architecture
The delayed queue consists of four components:
Job Pool – stores metadata of all jobs.
Delay Bucket – ordered queues indexed by time, holding only job IDs.
Timer – continuously scans buckets and moves jobs whose delay has expired into the corresponding Ready Queue.
Ready Queue – holds jobs ready for consumption.
Key Concepts
Basic Concepts
Job – the basic unit representing an asynchronous task, associated with a specific Topic.
Topic – a collection of jobs of the same type, subscribed to by consumers.
Message Structure
Each job must contain the following fields:
Topic – the job type or business name.
Id – unique identifier used for lookup and deletion.
Delay – delay time in seconds (converted to absolute time by the server).
TTR (time‑to‑run) – execution timeout in seconds.
Body – JSON payload processed by the consumer.
Message State Transitions
A job can be in one of four states:
ready – executable, waiting for consumption.
delay – not executable, waiting for its time slot.
reserved – fetched by a consumer but not yet finished or deleted.
deleted – consumption completed or explicitly removed.
Message Storage
Job Pool uses a simple key/value map (job id → job struct). Delay Bucket is an ordered queue, implemented with Redis sorted sets (zset). Ready Queue can be a plain list or queue. Redis satisfies all these requirements.
Communication Protocol
HTTP with JSON text protocol is used to support multiple language clients. Supported commands:
{'command':'add','topic':'xxx','id':'xxx','delay':30,'TTR':60,'body':'xxx'} {'command':'pop','topic':'xxx'} {'command':'finish','id':'xxx'} {'command':'delete','id':'xxx'}Responses are of the form
{'success':true/false,'error':'error reason','id':'xxx','value':'job body'}. Job IDs must be globally unique, typically composed of topic plus a business‑unique identifier.
Job Lifecycle Example
A user places an order; the system creates a job with topic “order_close”, a delay of 1800 seconds, TTR = 60, and the order data in the body.
The queue stores the job metadata in the Job Pool, calculates the absolute execution time, and places the job ID into a Delay Bucket.
The Timer scans buckets; when the delay expires, it moves the job ID to the Ready Queue.
The consumer polls the Ready Queue, processes the job, and the server re‑queues the job into a bucket according to its TTR.
After successful processing, the consumer sends a finish command, and the server deletes the job metadata.
Current Physical Topology
A centralized storage mechanism is used. When multiple Timer instances run, they may concurrently scan buckets, causing duplicate insertion into the Ready Queue. A simple distributed lock using Redis SETNX ensures only one Timer thread scans a bucket at a time.
Design Shortcomings
The Timer runs an infinite loop, wasting CPU when no ready jobs exist.
Consumers use short‑polling HTTP and fetch only one job per request, increasing network I/O under heavy load.
Reliance on Redis limits persistence capabilities.
Scale‑out depends on external components such as Nginx.
Future Architecture Directions
Implement Timer with wait/notify to avoid busy loops.
Provide a TCP long‑connection API for push or long‑polling job reservation.
Develop a custom storage solution (embedded DB or file‑based structures) to guarantee persistence.
Build an internal name‑server.
Offer native support for periodic tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
