Optimizing Write Hotspots in TiDB: Architecture Overview, Problem Analysis, and Solution
This article details a real‑world case of write‑hotspot performance degradation in a TiDB 3.0.9 cluster, explains TiDB’s architecture, analyzes monitoring data and hot‑region tables, and presents a redesign using varchar primary keys and SHARD_ROW_ID_BITS to disperse writes and dramatically improve latency.
1. Background
The content‑master data project stores its data in a distributed relational database TiDB (version 3.0.9 at the time of writing). This article uses a business‑level optimization case that solved a write‑hotspot problem to briefly introduce TiDB architecture, the cluster’s basic situation, the encountered issues and their solutions, and finally the importance of database development standards.
Content‑master data is a core system of the unified middle‑platform, providing a consistent data source for all automotive content across platforms. Future plans include ingesting data from all content systems, defining unified standards, and offering a centralized, consistent data outlet for front‑end services.
2. TiDB Introduction
TiDB is an open‑source distributed HTAP (Hybrid Transactional and Analytical Processing) database that combines the best features of traditional RDBMS and NoSQL. It is MySQL‑compatible, supports unlimited horizontal scaling, offers strong consistency and high availability, and aims to provide a one‑stop solution for both OLTP and OLAP workloads.
TiDB Server
TiDB Server receives SQL requests, processes SQL logic, locates the required TiKV addresses via PD, interacts with TiKV to fetch data, and returns results. It is stateless, does not store data, and can be horizontally scaled behind load balancers such as LVS, HAProxy, or F5.
PD Server
Placement Driver (PD) manages the cluster metadata, performs scheduling and load balancing for TiKV, and allocates globally unique, incrementing transaction IDs. PD uses the Raft protocol for safety and recommends deploying an odd number of PD nodes.
TiKV Server
TiKV Server provides a distributed transactional key‑value storage engine. Data is divided into Regions (key ranges). Each Region is replicated via Raft groups across multiple TiKV nodes, and PD schedules Regions to balance load.
3. Cluster Information
3.1 Basic Information
The TiDB cluster consists of the following components (all version 3.0.9):
Component
Version
Count
tidb
v3.0.9
5
pd
v3.0.9
3
tikv
v3.0.9
12 (multiple instances per machine)
pump
v3.0.9
4
drainer
v3.0.9
3
Pump records TiDB binlogs in real time, orders them by transaction commit time, and provides them to Drainer. Drainer aggregates binlogs from all Pumps and synchronizes them downstream to Kafka.
3.2 Cluster Topology
Business data is written to TiDB via two CDC pipelines: (1) SQL Server CDC, (2) MySQL Otter. Multiple TiDB nodes behind LVS serve business traffic. Two Drainers stream TiDB data to a downstream Kafka cluster for incremental consumption, while a third Drainer writes data to disk for offline analysis (e.g., using the reparo tool to parse binlogs).
4. Problem Description
The write path is: Drainer → Kafka → consumer → TiDB summary database.
Business teams reported that writes to the TiDB summary database were slow and Kafka messages were piling up. Monitoring showed SQL‑99 response times around 200‑220 ms, while insert QPS was only about 4,500 per second, indicating a clear performance bottleneck.
5. Problem Analysis
5.1 Slow‑Log Inspection
Using pt-query-digest on the slow‑log, the top slow SQL statements were simple INSERTs with an average execution time of 8 ms, suggesting that the latency was not caused by complex queries.
Insert IGNORE Into table_name (ciz_id,data_type,field,`value`) values(37163712,'club','club_is_jinghua','0') \G5.2 Monitoring Indicators
Hot‑Write panel revealed three anomalies on a single TiKV node:
gRPC poll CPU usage significantly higher than other nodes.
QPS on that TiKV much higher than the rest.
Overall CPU usage on that node markedly higher.
These symptoms indicate a write hotspot concentrating load on one TiKV instance.
5.3 Hot‑Region Table Identification
TiDB 3.0 provides the system view information_schema.TiDB_HOT_REGIONS to locate hot tables or indexes:
select * from information_schema.TiDB_HOT_REGIONS where TYPE='write';The query revealed that four summary tables of the content data were the hot spots. Their schema (simplified) used an auto‑increment id as the primary key:
CREATE TABLE `table_name` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'primary key',
`data_type` varchar(100) NOT NULL COMMENT 'data type',
`ciz_id` bigint(11) NOT NULL COMMENT 'data identifier',
`content` mediumtext DEFAULT NULL COMMENT 'content',
`update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'ingest time',
PRIMARY KEY (`id`),
UNIQUE KEY `uniq_type_cizid` (`data_type`,`ciz_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='XXX table';Because TiDB splits data by primary‑key values into Regions, an auto‑increment key causes most recent rows to land in the same Region, creating a write hotspot on a single TiKV node.
6. Solution
To eliminate the hotspot, the table design was changed:
Remove the auto‑increment integer primary key.
Use a varchar primary key.
Configure SHARD_ROW_ID_BITS to scatter rows across Regions.
Revised DDL example:
CREATE TABLE `table_name` (
`global_id` varchar(20) NOT NULL COMMENT 'primary key',
`data_type` varchar(100) NOT NULL COMMENT 'data type',
`ciz_id` bigint(11) NOT NULL COMMENT 'data identifier',
`content` mediumtext DEFAULT NULL COMMENT 'content',
`update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'ingest time',
PRIMARY KEY (`global_id`),
UNIQUE KEY `uniq_type_cizid` (`data_type`,`ciz_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 SHARD_ROW_ID_BITS=4 COMMENT='XXX table';After applying the new schema and re‑running performance tests, the write latency dropped dramatically: SQL‑99 response time fell from ~200 ms to around 20 ms.
The improvement was achieved thanks to close collaboration between developers and testers.
7. Summary
Although TiDB automatically rebalances data, business characteristics or sudden load spikes can still create hotspots that become performance bottlenecks. Designing tables for a distributed database differs from traditional single‑node databases; developers must assume that a table’s rows will be spread across many machines.
Based on the experience, the following database development guidelines were established for TiDB 3.0:
Primary keys must be varchar and configured with SHARD_ROW_ID_BITS to avoid write hotspots.
Do not use auto‑increment integer keys. SHARD_ROW_ID_BITS=4 creates 16 (2^4) random ranges for the internal row ID.
From TiDB 4.0 onward, the AUTO_RANDOM syntax can replace AUTO_INCREMENT for integer primary keys, automatically scattering IDs and eliminating hotspots.
CREATE TABLE `tb_example` (
`user_id` varchar(20) NOT NULL COMMENT 'user id',
`name` varchar(10) NOT NULL DEFAULT '' COMMENT 'name',
`created_stime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'creation time',
`modified_stime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'modification time',
PRIMARY KEY (`user_id`),
KEY `idx_modified_stime` (`modified_stime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 SHARD_ROW_ID_BITS=4 COMMENT='TiDB standard table';8. Conclusion
This case study of write‑hotspot optimization leads to a set of database development standards that apply not only to TiDB but also to other databases such as SQL Server, MySQL, and MongoDB. Following these standards helps improve database performance, efficiency, and ultimately better serves business needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
