TiDB Operational Practices and Performance Benchmarking at Beijing Shunfeng Tongcheng Technology
This article presents a comprehensive case study of TiDB deployment at Beijing Shunfeng Tongcheng Technology, covering application scenarios, TiDB features, detailed performance benchmarks, operational challenges, optimization techniques, ecosystem tools, and best‑practice recommendations for large‑scale distributed database management.
1. Application Scenario Introduction
Currently TiDB is used at Beijing Shunfeng Tongcheng Technology (Beike) for the SDS system, which relies on massive real‑time data synchronized from the group’s Kafka. The system requires large storage capacity, flexible scalability, high efficiency, high stability and high availability.
With rapid business growth, a 12‑node TiDB cluster now stores about 243 million incremental rows per day. This article shares operational practices and challenges of TiDB at Beike.
2. Why Choose TiDB
2.1 TiDB Features
TiDB combines the best of traditional RDBMS and NoSQL, is MySQL‑compatible, supports unlimited horizontal scaling, and provides strong consistency and high availability.
Key characteristics:
High MySQL compatibility – most applications can migrate without code changes.
Horizontal elastic scaling – add nodes to increase throughput or storage.
Distributed ACID transactions.
Financial‑grade high availability using Raft‑based majority election.
2.2 Surprising Benefits
In TiDB you no longer worry about primary‑node capacity, you get native online DDL, column additions/modifications complete in seconds without table rebuild, no master‑slave lag, and extensive monitoring metrics plus ecosystem automation tools.
2.3 Performance Benchmark
Hardware Configuration
Service Type
Instance Type
Instance Count
PD
BMI5 (96 cores / 384 GB / 7 TB NVMe SSD)
3
TiKV
BMI5 (96 cores / 384 GB / 7 TB NVMe SSD)
6
TiDB
BMI5 (96 cores / 384 GB / 7 TB NVMe SSD)
6
Sysbench
BMI5 (96 cores / 384 GB / 7 TB NVMe SSD)
1
Software Versions
Service Type
Software Version
PD
3.0.18
TiKV
3.0.18
TiDB
3.0.18
Sysbench
3.0.18
Write Test
Threads
QPS
95% Latency (ms)
16
7705
2.81
32
13338
3.82
64
21641
5.18
128
33155
7.84
256
44574
12.08
512
58604
17.32
768
67901
22.28
1024
75028
26.68
1536
86010
34.33
2048
92380
44.98
2500
96671
54.80
OLTP Read/Write Test
Threads
QPS
95% Latency (ms)
16
18000
22
32
35600
23.1
64
60648
26.68
128
92318
33.12
256
113686
55.82
512
138616
94.1
768
164364
134.9
1024
190981
167.44
1536
223237
204.11
2048
262098
231.53
2500
276107
272.27
Read‑Only Test
Threads
QPS
95% Latency (ms)
16
24235.51
15.27
32
45483.64
16.71
64
80193.6
17.95
128
123851.61
20.37
256
144999.89
34.30
512
174424.94
58.92
768
183365.72
86
1024
200460.98
108.68
1536
236120.82
153.02
2048
264444.73
204.11
2500
285103.48
253.35
3. Problems Encountered on TiDB and Solutions
3.1 High average latency and Raftstore thread CPU saturation
In massive‑data, limited‑resource scenarios a single TiKV holds many Regions, causing heavy Raftstore thread overhead. Version 2.x fixed the thread count at 2, creating a bottleneck.
Upgrading from 2.1 to 3.0 GA brought:
Stability: support for >150 storage nodes and >300 TB stable storage.
Ease of use: standardized slow‑query logs, EXPLAIN ANALYZE, SQL Trace, etc.
Performance: TPC‑C ~4.5×, Sysbench ~1.5× improvement; View support enables TPC‑H 50 GB Q15.
New features: window functions, experimental views, partitioned tables, plugin system, pessimistic lock (experimental), SQL Plan Management.
Average 999‑percentile latency dropped >5× to 400‑500 ms.
3.2 Execution‑plan anomalies causing high load
Incorrect statistics lead to wrong index choices, causing full‑table scans on tables with billions of rows. The solution is to keep statistics accurate by increasing the Analyze frequency.
Automatic Statistics Update
TiDB automatically updates total row count and modified rows; the update interval is controlled by stats‑lease (default 3 s). Setting it to 0 disables auto‑update.
System Variable
Default
Function
tidb_auto_analyze_ratio0.5
Auto‑update threshold
tidb_auto_analyze_start_time 00:00 +0000Start time of daily auto‑analyze window
tidb_auto_analyze_end_time 23:59 +0000End time of daily auto‑analyze window
When a table’s modify_count exceeds tidb_auto_analyze_ratio of its total rows and the current time falls within the configured window, TiDB runs ANALYZE TABLE tbl automatically.
3.3 Write‑write and read‑write conflicts increasing latency
TiDB uses an optimistic lock model with a two‑phase commit (Prewrite → Commit). High concurrency on the same order record leads to many txnLock (write‑write) and txnLockFast (read‑write) conflicts, raising response time.
Mitigation strategies:
TiDB lock‑conflict pre‑check – early detection of conflicts before sending requests to TiKV.
Attempt pessimistic transaction lock (performance not better).
Serializing operations on the same order record: Redis distributed lock (adds complexity). Asynchronous write via message queue – route same key to a single Kafka partition to avoid concurrent writes.
After several analysis‑optimization cycles, conflict counts were reduced and overall average latency decreased, improving cluster stability.
4. Ecosystem
4.1 DbKiller
Self‑developed tool that kills long‑running queries based on configurable policies, useful for handling abnormal queries and preventing database snowball failures.
4.2 DbCleaner
Self‑developed tool for window‑based archiving of tables, compatible with MySQL and checks replica lag.
4.3 Data Migration (DM)
Integrated data synchronization platform supporting full and incremental migration from MySQL or MariaDB to TiDB, simplifying error handling and reducing operational cost.
4.4 TiDB Lightning
Tool for fast bulk import of new data or full‑backup restoration, supporting multiple migration and upgrade scenarios.
5. Optimization Practices
5.1 Hotspot Issues
TiDB splits data by Region (default 96 MB). AUTO_INCREMENT primary keys cause writes to concentrate on a single Region, creating hotspots. Solutions include using the implicit _tidb_rowid with SHARD_ROW_ID_BITS and PRE_SPLIT_REGIONS to randomize row IDs and pre‑split regions.
5.2 Archiving Historical Data
Introduce partitioned tables with order‑time keys; drop partitions to delete old data efficiently, avoiding performance impact of massive DELETE operations.
5.3 Data Backup
Backup & Restore (BR) provides efficient full and incremental backups for a ~14 TB cluster, reducing backup time from days to a manageable window.
5.4 Cluster State Diagnosis
TiDB Dashboard (v4.0) offers comprehensive metrics and diagnostic reports, enabling quick identification of abnormal states and performance bottlenecks.
6. Conclusion
TiDB, as a next‑generation high‑performance distributed database, is a strong choice for massive data storage scenarios. Ongoing community engagement and internal operational improvements will continue to guide its adoption in suitable use cases.
Beijing SF i-TECH City Technology Team
Official tech channel of Beijing SF i-TECH City. A publishing platform for technology innovation, practical implementation, and frontier tech exploration.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.