How to Build a Low‑Latency Timeout Center with Redis: Architecture and Design

This article explains the drawbacks of traditional high‑latency timeout centers and presents a Redis‑based low‑latency design, detailing task storage, scheduling, topic and queue structures, two‑phase consumption, retry control, and the resulting performance and reliability benefits.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Build a Low‑Latency Timeout Center with Redis: Architecture and Design

Background

Many products have lifecycle designs that require actions at specific time nodes. The TimeOutCenter (TOC) stores and schedules timeout tasks; delays in dispatching low‑latency tasks can severely affect product performance.

Traditional High‑Latency Scheme

Overall Framework

In the traditional design, tasks are written to a timeout task database. A timer triggers a database scanner that loads expired tasks into an in‑memory queue, from which business processors handle them and update the task status.

Task Library Design

The task library uses sharding (e.g., 8 databases, 1024 tables). Key fields include:

job_id        bigint unsigned   // globally unique timeout task ID
gmt_create    datetime          // creation time
gmt_modified  datetime          // modification time
biz_id        bigint unsigned   // business ID (order ID)
biz_type      bigint unsigned   // business type
status        tinyint           // task status (0 pending, 2 processed, 3 cancelled)
action_time   datetime          // scheduled execution time
attribute     varchar           // extra data

Timer Scheduling Design

The timer fires every 10 seconds, obtains the cluster IP list from a config server, assigns tables to machines, and each machine scans its assigned tables for pending tasks. Tasks are enqueued only if the in‑memory queue has capacity.

Drawbacks

The timer interval adds latency to task processing.

Database sharding limits concurrency; a table can be owned by only one machine.

Scanning large tables is time‑consuming.

Overall latency equals timer interval plus scan time, which can be large under heavy load.

Low‑Latency Scheme

Overall Framework

Tasks are first stored in the same sharded task library. Then the job ID and action time are placed into a Redis cluster. When the timeout expires, the job ID is popped from Redis, the full task is fetched from the database, processed, and its status updated.

Redis Storage Design

Each topic defines a name, slot amount (power‑of‑two), and type. Messages are stored in Redis Sorted Sets distributed across slots.

StoreQueue Design

Messages are stored in a Sorted Set where the score is the timestamp; popping retrieves the smallest score greater than the current time.

PrepareQueue Design

To guarantee at‑least‑once delivery, messages are moved from StoreQueue to PrepareQueue via a Lua script before consumption. Successful processing deletes the message; failures move it back, implementing a two‑phase commit.

DeadQueue Design

After 16 retries, a message is moved to DeadQueue, using the same hash‑tag technique to keep related keys on the same Redis node.

Message Production

Producers compute the slot key (using CRC32 of a slot basis) and add the message to the appropriate Sorted Set with the action time as the score.

Message Consumption

Workers (threads) are assigned slots via Zookeeper. Each worker repeatedly executes ZRANGEBYSCORE to pop messages whose score is less than the current timestamp.

At‑Least‑Once Guarantee

The two‑phase approach mirrors a bank transfer: resources are frozen in PrepareQueue, then either committed or rolled back based on consumer outcome.

Retry Control

PrepareQueue scores combine a millisecond timestamp and retry count (timestamp*1000 + retry). On failure, the message is moved back to StoreQueue with a decremented retry count; after exhausting retries it goes to DeadQueue.

Advantages

Low latency: direct pop from Redis eliminates database scans.

Configurable concurrency: determined by slot count and worker number.

High performance: Redis can handle >100 k QPS with O(1) retrieval.

High availability: at‑least‑once delivery and controlled retries ensure reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendDistributed SchedulingredisLow latencyTimeouttwo-phase commit
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.