How to Build a Scalable Order Cancellation System: 3 Advanced Delayed‑Task Solutions
This article dissects a common interview question about automatically canceling unpaid orders after 30 minutes, explains why naive cron jobs fail at scale, and presents three robust backend designs—Redis ZSet polling, message‑queue delayed messages, and time‑wheel timers—along with practical code snippets and pitfalls to avoid.
Background and Problem
A candidate was asked in an interview how to automatically cancel orders that remain unpaid for 30 minutes. A simple answer using a scheduled task that scans the whole order table every minute is insufficient for high‑traffic systems because it causes heavy database load, latency, and reliability issues.
Why a Plain Cron Job Is Inadequate
Poor timeliness : polling introduces delay and cannot guarantee second‑level precision.
Database pressure : full‑table scans on large tables (e.g., millions of rows) consume excessive CPU.
Resource waste : most scans find no expired orders, yet the task still runs.
The core idea is to avoid polling and let overdue orders “find themselves” through event‑driven mechanisms.
Solution 1: Redis Expiration Listener (A Trap)
Some suggest storing order IDs in Redis with a 30‑minute TTL and relying on the expired‑event callback.
Drawbacks:
Unreliable – if the service restarts or the network drops the event, the cancellation is lost.
Significant latency – Redis deletes keys lazily, so expiration may be delayed by minutes.
Solution 2: Redis ZSet + Polling (Recommended Standard)
This approach uses a Redis sorted set (ZSet) where the score stores the exact expiration timestamp and the value stores the order ID.
Production (enqueue) step :
ZADD delay_queue <timestamp_30min_later> <OrderId>Consumption (polling) step :
ZRANGEBYSCORE delay_queue 0 <current_timestamp> LIMIT 0 10A background thread runs every second, fetches expired entries, and processes cancellations. Advantages include high performance (in‑memory), second‑level accuracy, and low latency.
Advanced safeguard : Use an ACK mechanism or a two‑phase process. Instead of deleting directly, atomically move the order ID from delay_queue to a processing_queue via a Lua script, then delete it after successful business logic. A watchdog thread rescans processing_queue for stuck tasks, guaranteeing at‑least‑once processing.
Solution 3: Message Queue / Time Wheel (Architect‑Level)
A. Message Queues (RocketMQ / RabbitMQ)
RocketMQ : Versions 4.x only support fixed delay levels; 5.0 adds arbitrary delay support. Use it or fall back to Redis ZSet.
RabbitMQ : Native TTL + dead‑letter queue suffers from “head‑of‑queue blocking”. The rabbitmq_delayed_message_exchange plugin solves this.
B. Hashed Wheel Timer (Time Wheel)
The algorithm divides time into 60 slots (one per second). An order expiring in 30 minutes is placed in the slot offset by 1800 seconds. A rotating pointer checks the current slot each second, providing pure in‑memory, ultra‑fast triggering.
Pros : Extremely efficient memory operations.
Cons : Not durable; data is lost on restart.
Production practice : Combine Redis ZSet for persistence with an in‑memory time wheel for high‑frequency triggers. Load near‑term tasks into the wheel at startup.
Defensive Q&A (Handling Edge Cases)
Q1: Prevent duplicate cancellations when multiple nodes poll the ZSet. Use a Lua script to atomically ZRANGE and ZREM, ensuring only one consumer removes the entry. Also make the cancellation service idempotent.
Q2: Scaling ZSet to billions of orders. Shard the delay queue into multiple keys (e.g., delay_queue_0 … delay_queue_9) based on order‑ID hash, and run parallel pollers to multiply throughput.
Q3: Middleware outage. Keep a fallback offline scan task (e.g., a T+1 job running on a replica) that nightly scans missed unpaid orders, ensuring eventual consistency.
Standard Interview Answer Template
Architecture choice : Prefer Redis ZSet as a lightweight delayed queue (Score = expiration timestamp, Value = order ID).
Core flow : Background thread runs every second, executes ZRANGEBYSCORE to fetch expired orders, then uses a Lua script for atomic removal and cancellation.
Reliability : Introduce a processing queue with ACK semantics; make the cancellation API idempotent.
Advanced optimization : For massive scale, switch to RocketMQ 5.0 arbitrary‑delay messages.
Fallback : Retain a low‑frequency database scan task to guarantee final consistency.
Key Takeaways
Designing delayed tasks for massive data requires moving away from database polling toward event‑driven or in‑memory structures that provide precise timing, low latency, and fault tolerance. Combining Redis ZSet with idempotent processing, optional MQ delay messages, and a fallback scan yields a robust solution applicable not only to order cancellation but also to coupon expiry, appointment reminders, and refund processing.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
