Backend Development 28 min read

How Ctrip Built a High‑Availability Distributed Cache with Eventual and Strong Consistency

This article details Ctrip Finance's design and implementation of a unified high‑availability Redis‑based distributed cache, covering eventual and strong consistency scenarios, multi‑region deployment, lock mechanisms, message‑driven updates, failure handling, and performance results.

dbaplus Community

Sep 16, 2021

How Ctrip Built a High‑Availability Distributed Cache with Eventual and Strong Consistency

1. Introduction

Ctrip Finance evolved its architecture from a single instance to a multi‑layer system, using caching to relieve MySQL read pressure and improve response times. Introducing a cache inevitably creates consistency challenges, which vary by business scenario. The article examines two scenarios: eventual consistency and strong consistency distributed caches.

2. Eventual Consistency Distributed Cache

2.1 Scenario Description

The finance platform built a unified cache service (named utag ) that stores full, near‑real‑time, permanently valid data such as user, product, and order information. Business systems call a single cache query interface when real‑time consistency is not required, while high‑consistency paths still query the DB directly.

2.2 Overall Solution

The service is deployed across multiple data centers (AB regions) to provide low‑latency local access and disaster‑recovery capability. Cache updates are triggered by various sources—scheduled scans, business‑system MQ, and binlog‑based MQ—to ensure no updates are missed. All update sources publish a unified MQ message; each region’s cache instances consume the message and refresh their data. The service uses Ctrip’s open‑source QMQ and Kafka for messaging.

Read operations use a Dubbo‑based cache query interface; the service deserializes Redis data into business models before returning them.

2.3 Data Accuracy Design

Cache updates follow four steps: (1) trigger update and query new DB data, (2) receive MQ and fetch old cache data, (3) compare new and old data, (4) update cache if needed. To handle concurrent updates, the service serializes steps 2‑4 with a Redis‑based distributed lock.

Because updates may occur within the same second, the system uses an updateTime check: if the new record’s update_time is newer than the cached one, the cache is overwritten. For same‑second updates, a delayed MQ (1 s) ensures the later value wins.

2.4 Data Completeness Design

To guarantee full‑table consistency, three mechanisms are employed:

Multiple update sources (scheduled tasks, business MQ, binlog MQ) act as redundant “eggs in different baskets”.

A weekly full‑table scan refreshes all cache entries.

Periodic validation jobs compare Redis and MySQL data, triggering compensation when mismatches are found.

2.5 System Availability Design

The cache service is critical to many core systems, so high availability is achieved through:

Cross‑region deployment with automatic fail‑over to the other region’s cache service.

Dual‑messaging middleware (QMQ and Kafka) with runtime switchability.

Fast recovery mechanisms that can rebuild the entire cache within 30 minutes using parallel tasks.

3. Strong Consistency Distributed Cache

3.1 Scenario Description

In the loan‑pre‑service, query volume is massive, making a cache essential. Neither sharding nor read‑write splitting meets the real‑time consistency requirement, so a strong‑consistency cache is designed.

3.2 Overall Solution

The pattern is “update DB → delete cache” and “read DB → update cache”. Without proper coordination, race conditions cause stale data. The solution adds distributed locks (Redis‑based) around both the DB‑update‑delete flow and the DB‑read‑update‑cache flow.

3.3 Cache Deletion Strategy

During an update, the DB is modified first, then the cache key is deleted within the same Redis lock. If deletion fails (e.g., Redis outage), a cache_key_queue table records the pending key. A background task scans this table and retries deletion.

CREATE TABLE `cache_key_queue` (
    `id` bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT COMMENT 'primary key',
    `cache_key` varchar(1024) NOT NULL DEFAULT '' COMMENT 'key to delete',
    `create_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'creation time',
    PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=0 CHARSET=utf8 COMMENT='cache deletion queue';

3.4 Cache Circuit‑Breaker and Recovery

If Redis becomes unavailable, the service short‑circuits cache operations to avoid extra latency. A simple counter triggers circuit‑breaker when 50 errors occur within 10 seconds. Recovery checks Redis health by repeatedly performing SET commands on different keys; once successful, write operations resume, and pending cache_key_queue entries are cleared before read operations are re‑enabled.

3.5 Summary

The strong‑consistency design reduces core DB QPS by 80 % and achieves a 92 % cache hit rate, while average response time improves by about 10 %. Although locking adds overhead, the high hit rate offsets the cost, delivering a robust, low‑latency service.

4. Overall Conclusions

Both eventual‑consistency and strong‑consistency cache solutions are presented. The former favors read‑through cache with tolerant staleness and aggressive compensation; the latter prefers DB‑first reads with strict locking to guarantee consistency. Choosing the appropriate model depends on the specific business tolerance for stale data and performance requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Microservices Redis distributed cache consistency

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.