Real-time Checking System for Data Consistency in Microservices

Shopee’s Real‑time Checking System provides configurable, non‑intrusive data consistency verification for micro‑services by capturing change events via CDC, streaming them through Kafka, applying flexible rules and expressions, and instantly alerting mismatches, delivering second‑level detection while scaling to tens of thousands of checks per second.

Shopee Tech Team
Shopee Tech Team
Shopee Tech Team
Real-time Checking System for Data Consistency in Microservices

With the rapid development of enterprise business and the trend of splitting monolithic services into micro‑services, the amount of inter‑service communication has increased dramatically. Unlike monolithic applications, micro‑services often need additional mechanisms—such as transactional messages or asynchronous compensation tasks—to guarantee data consistency. Beyond these mechanisms, timely observation and detection of data inconsistencies are crucial.

This article introduces the Real‑time Checking System (RCS) designed and built by the Shopee Financial Products team. RCS can be integrated with minimal effort: users only need to configure checking rules according to their requirements. The system supports hot‑loading of rules and performs real‑time, non‑intrusive data monitoring, enabling rapid detection of inconsistencies across multiple Shopee product lines.

目录

1. 背景
   1.1 系统数据的不一致性
   1.2 离线核对的缺陷
2. 实时数据核对
   2.1 系统架构与核对流程
   2.2 核对功能演进
3. 性能表现
4. 总结

1. 背景

1.1 系统数据的不一致性

During normal development cycles, data may not change as expected. For example, a repayment request may be processed by System A, which then calls System B to unfreeze an account. If the unfreeze request or its response is lost due to failures or network partitions, the system may record a payment without the corresponding unfreeze, or vice‑versa, leading to user complaints or financial loss.

Typical causes include code bugs, improper concurrency handling, component failures (network, database, middleware), and the lack of native consistency guarantees when monolithic applications are split into micro‑services. In distributed scenarios, the loss of database transaction support necessitates additional consistency solutions.

Common consistency solutions involve local transaction tables combined with compensation tasks, or Saga patterns with reliable messaging. However, for critical business flows, an extra layer of verification is essential, prompting the need for real‑time, non‑intrusive data checking.

1.2 离线核对的缺陷

Traditional offline checks run as scheduled jobs that fetch data from multiple sources based on filter conditions and compare them. A typical pseudo‑code implementation is:

func Check() {
    // Get upstream rows whose update_time falls in [a, b)
    upstreamRows := QueryUpstreamDB(a, b)

    for uniqueKey, sourceData := range upstreamRows {
        // Find corresponding downstream data
        targetData := QueryDownstreamDB(uniqueKey)

        // Compare upstream and downstream data
        Compare(sourceData, targetData)
    }
}

The main drawbacks of this approach are low timeliness (checks run after a delay), additional scanning overhead on large tables, and the inability to capture every state change between checks.

To achieve better data verification, the following goals were defined:

Second‑level checking.

Minimize database queries.

Check data changes rather than snapshots.

Simple and flexible integration.

2. 实时数据核对

In mid‑2021, the Shopee Financial Products team designed and implemented the Real‑time Checking System (RCS). Its core advantages are:

Second‑level data checking.

Zero intrusion to business logic.

Configurable integration.

Since its launch, RCS has helped teams detect numerous data issues, which can be categorized into code‑logic bugs (idempotency, concurrency, business errors) and runtime environment problems (DB failures, network jitter, MQ anomalies).

2.1 系统架构与核对流程

RCS is organized into three layers:

Data Fetching Layer : Captures change data in real time using log‑based CDC (e.g., MySQL binlog, MongoDB oplog) and streams it to Kafka.

Data Checking Layer : Receives the change stream, applies configurable checking rules, and performs comparisons.

Result Handling Layer : Persists results, triggers alerts, and provides recovery awareness.

The Data Fetching Layer uses a log‑based CDC approach for high timeliness. MySQL binlog events are captured by a high‑availability component and delivered to Kafka, where they become the source of truth for checking. The system also supports custom Kafka messages for non‑MySQL sources (e.g., MongoDB, Redis).

2.1.1 变更数据获取

Log‑based CDC provides true change events rather than snapshots, avoiding the latency of timestamp‑based or table‑differencing methods. Triggers add write overhead, while log‑based capture directly streams changes with minimal impact.

2.1.2 数据核对

The Data Checking Layer abstracts binlog data into a set of configurable rules. Users define rules via a UI or configuration file; the system then automatically applies them. Example rule configuration (image in original article) shows how a rule maps fields from upstream to downstream.

Data flow:

Upstream data arrives first, is temporarily stored in Redis and a delay queue.

RCS waits for the downstream counterpart. If it arrives within the configured window, the two records are compared and the Redis key is removed.

If the downstream data does not arrive, the delay queue triggers a timeout check, generating an alert if the consistency window is exceeded.

2.1.3 消息通知机制

RCS integrates with Shopee’s enterprise IM (SeaTalk) bot to send alerts. Four notification types are provided:

Mismatch Notice – immediate alert for a single inconsistency.

Aggregated Notice – aggregates many mismatches to avoid alert fatigue.

Recovery Notice – informs when a previously mismatched record becomes consistent again.

Statistical Notice – periodic reports on DB replication lag, success rates, etc.

2.2 核对功能演进

Initially, RCS supported simple equality and mapping checks. As more teams adopted the system, richer capabilities were added:

2.2.1 等值 / 映射核对

Equality checks compare fields directly (e.g., loan_amount == order_amount) and status mappings (e.g., loan_status == 4 && order_status == 2).

2.2.2 表达式核对

To handle non‑one‑to‑one field relationships, RCS introduced expression evaluation. Users provide boolean expressions such as: a.order_amount == b.paid_amount + b.loan_amount This allows complex validations without writing new code.

2.2.3 动态配置数据核对

For dynamic data like rates or discounts stored in configuration tables, RCS can execute user‑defined SQL queries during checking. The result of the query is then used in an expression, e.g., a.order_rate == rate. JSON‑based fields can also be parsed with custom expressions.

3. 性能表现

Performance is dominated by the Data Fetching and Data Checking layers. In Shopee’s environment, the system processes over 20 K messages per second from Kafka. The checking layer can handle more than 10 K checks per second on a single 48‑core machine.

Benchmark results:

Component

Machine

Kafka

3 × 48 Core 128 GB

Redis

3 × 48 Core 128 GB

Real‑time Checking System

1 × 48 Core 128 GB

Check throughput per entry count:

Number of check entry

TPS

CPU Cost

1 entry

14.3K

454%

2 entries

12.0K

687%

3 entries

10.4K

913%

The main bottleneck is the Redis cluster; each check takes roughly 0.5 ms. RCS can be deployed in a Kafka consumer group to scale horizontally.

4. 总结

The Real‑time Checking System, launched in 2021, has been adopted across multiple Shopee product lines, addressing the high latency of traditional offline checks and reducing development overhead for new consistency requirements. By offering configurable rules, expression‑based checks, and log‑based CDC, RCS provides near‑real‑time data verification, mitigating risks related to financial loss and information security. Future work will focus on further performance improvements to support growing business volumes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed systemsRedisData ConsistencyCDCreal-time checking
Shopee Tech Team
Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.