Real-time Checking System for Data Consistency in Microservices
Shopee’s Real‑time Checking System provides configurable, non‑intrusive data consistency verification for micro‑services by capturing change events via CDC, streaming them through Kafka, applying flexible rules and expressions, and instantly alerting mismatches, delivering second‑level detection while scaling to tens of thousands of checks per second.
With the rapid development of enterprise business and the trend of splitting monolithic services into micro‑services, the amount of inter‑service communication has increased dramatically. Unlike monolithic applications, micro‑services often need additional mechanisms—such as transactional messages or asynchronous compensation tasks—to guarantee data consistency. Beyond these mechanisms, timely observation and detection of data inconsistencies are crucial.
This article introduces the Real‑time Checking System (RCS) designed and built by the Shopee Financial Products team. RCS can be integrated with minimal effort: users only need to configure checking rules according to their requirements. The system supports hot‑loading of rules and performs real‑time, non‑intrusive data monitoring, enabling rapid detection of inconsistencies across multiple Shopee product lines.
目录
1. 背景
1.1 系统数据的不一致性
1.2 离线核对的缺陷
2. 实时数据核对
2.1 系统架构与核对流程
2.2 核对功能演进
3. 性能表现
4. 总结1. 背景
1.1 系统数据的不一致性
During normal development cycles, data may not change as expected. For example, a repayment request may be processed by System A, which then calls System B to unfreeze an account. If the unfreeze request or its response is lost due to failures or network partitions, the system may record a payment without the corresponding unfreeze, or vice‑versa, leading to user complaints or financial loss.
Typical causes include code bugs, improper concurrency handling, component failures (network, database, middleware), and the lack of native consistency guarantees when monolithic applications are split into micro‑services. In distributed scenarios, the loss of database transaction support necessitates additional consistency solutions.
Common consistency solutions involve local transaction tables combined with compensation tasks, or Saga patterns with reliable messaging. However, for critical business flows, an extra layer of verification is essential, prompting the need for real‑time, non‑intrusive data checking.
1.2 离线核对的缺陷
Traditional offline checks run as scheduled jobs that fetch data from multiple sources based on filter conditions and compare them. A typical pseudo‑code implementation is:
func Check() {
// Get upstream rows whose update_time falls in [a, b)
upstreamRows := QueryUpstreamDB(a, b)
for uniqueKey, sourceData := range upstreamRows {
// Find corresponding downstream data
targetData := QueryDownstreamDB(uniqueKey)
// Compare upstream and downstream data
Compare(sourceData, targetData)
}
}The main drawbacks of this approach are low timeliness (checks run after a delay), additional scanning overhead on large tables, and the inability to capture every state change between checks.
To achieve better data verification, the following goals were defined:
Second‑level checking.
Minimize database queries.
Check data changes rather than snapshots.
Simple and flexible integration.
2. 实时数据核对
In mid‑2021, the Shopee Financial Products team designed and implemented the Real‑time Checking System (RCS). Its core advantages are:
Second‑level data checking.
Zero intrusion to business logic.
Configurable integration.
Since its launch, RCS has helped teams detect numerous data issues, which can be categorized into code‑logic bugs (idempotency, concurrency, business errors) and runtime environment problems (DB failures, network jitter, MQ anomalies).
2.1 系统架构与核对流程
RCS is organized into three layers:
Data Fetching Layer : Captures change data in real time using log‑based CDC (e.g., MySQL binlog, MongoDB oplog) and streams it to Kafka.
Data Checking Layer : Receives the change stream, applies configurable checking rules, and performs comparisons.
Result Handling Layer : Persists results, triggers alerts, and provides recovery awareness.
The Data Fetching Layer uses a log‑based CDC approach for high timeliness. MySQL binlog events are captured by a high‑availability component and delivered to Kafka, where they become the source of truth for checking. The system also supports custom Kafka messages for non‑MySQL sources (e.g., MongoDB, Redis).
2.1.1 变更数据获取
Log‑based CDC provides true change events rather than snapshots, avoiding the latency of timestamp‑based or table‑differencing methods. Triggers add write overhead, while log‑based capture directly streams changes with minimal impact.
2.1.2 数据核对
The Data Checking Layer abstracts binlog data into a set of configurable rules. Users define rules via a UI or configuration file; the system then automatically applies them. Example rule configuration (image in original article) shows how a rule maps fields from upstream to downstream.
Data flow:
Upstream data arrives first, is temporarily stored in Redis and a delay queue.
RCS waits for the downstream counterpart. If it arrives within the configured window, the two records are compared and the Redis key is removed.
If the downstream data does not arrive, the delay queue triggers a timeout check, generating an alert if the consistency window is exceeded.
2.1.3 消息通知机制
RCS integrates with Shopee’s enterprise IM (SeaTalk) bot to send alerts. Four notification types are provided:
Mismatch Notice – immediate alert for a single inconsistency.
Aggregated Notice – aggregates many mismatches to avoid alert fatigue.
Recovery Notice – informs when a previously mismatched record becomes consistent again.
Statistical Notice – periodic reports on DB replication lag, success rates, etc.
2.2 核对功能演进
Initially, RCS supported simple equality and mapping checks. As more teams adopted the system, richer capabilities were added:
2.2.1 等值 / 映射核对
Equality checks compare fields directly (e.g., loan_amount == order_amount) and status mappings (e.g., loan_status == 4 && order_status == 2).
2.2.2 表达式核对
To handle non‑one‑to‑one field relationships, RCS introduced expression evaluation. Users provide boolean expressions such as: a.order_amount == b.paid_amount + b.loan_amount This allows complex validations without writing new code.
2.2.3 动态配置数据核对
For dynamic data like rates or discounts stored in configuration tables, RCS can execute user‑defined SQL queries during checking. The result of the query is then used in an expression, e.g., a.order_rate == rate. JSON‑based fields can also be parsed with custom expressions.
3. 性能表现
Performance is dominated by the Data Fetching and Data Checking layers. In Shopee’s environment, the system processes over 20 K messages per second from Kafka. The checking layer can handle more than 10 K checks per second on a single 48‑core machine.
Benchmark results:
Component
Machine
Kafka
3 × 48 Core 128 GB
Redis
3 × 48 Core 128 GB
Real‑time Checking System
1 × 48 Core 128 GB
Check throughput per entry count:
Number of check entry
TPS
CPU Cost
1 entry
14.3K
454%
2 entries
12.0K
687%
3 entries
10.4K
913%
The main bottleneck is the Redis cluster; each check takes roughly 0.5 ms. RCS can be deployed in a Kafka consumer group to scale horizontally.
4. 总结
The Real‑time Checking System, launched in 2021, has been adopted across multiple Shopee product lines, addressing the high latency of traditional offline checks and reducing development overhead for new consistency requirements. By offering configurable rules, expression‑based checks, and log‑based CDC, RCS provides near‑real‑time data verification, mitigating risks related to financial loss and information security. Future work will focus on further performance improvements to support growing business volumes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Shopee Tech Team
How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
