Scaling TiDB to 200TB at Zhuanzhuan: Key Performance and Management Lessons
Zhuanzhuan’s adoption of TiDB addressed sharding challenges and massive data storage needs, and the team shares six common issues encountered in large‑scale online deployments—including performance diagnosis, cluster management, log inconsistencies, slow‑SQL impact, optimizer limitations, and transaction conflicts—along with their standardized solutions for deployment, monitoring, alerting, and business rollout.
Why Zhuanzhuan Adopted TiDB
Zhuanzhuan introduced TiDB primarily to eliminate the complexity of sharding and to handle petabyte‑scale data storage that exceeded the capacity of single‑node databases. By moving to a distributed SQL engine, the company aimed to simplify business logic, reduce development overhead, and improve cost efficiency.
Problems Faced in Large‑Scale Online Use
The team identified six recurring issues when operating dozens of TiDB clusters serving over 200 TB of data and billions of daily requests:
Performance diagnosis – Sudden increases in query latency required pinpointing offending SQL statements via TiDB monitoring dashboards, log inspection, and TiKV logs.
Cluster management – Managing dozens of clusters introduced challenges in deployment, version upgrades, and configuration compatibility, especially when PD instances conflicted under Systemd.
Inconsistent log formats – Different TiDB versions produced divergent slow‑query log schemas, making ETL and analysis costly.
Impact of slow SQL on stability – Heavy analytical or ETL queries could degrade overall cluster responsiveness.
Optimizer failing to hit indexes – Identical queries sometimes produced vastly different execution plans, leading to unpredictable performance.
Transaction conflicts – High contention on the same rows caused retries (retry_limit = 3) that dramatically slowed the cluster.
TiDB Cluster Standardization
1. Deployment Standardization
Zhuanzhuan deploys at least three TiDB servers for high availability, three PD servers, and six TiKV servers. Recommended hardware includes 10 GbE NICs for TiDB, 1 GbE or 10 GbE for PD, and SSDs for TiKV (single‑node capacity ≤ 400 GB).
Do not use TiDB for workloads smaller than 500 GB to avoid resource waste.
Limit each TiKV instance to ≤ 400 GB to shorten recovery time after failures.
Use 10 GbE NICs for TiDB/TiKV under high concurrency to prevent network bottlenecks.
Run multiple TiKV instances per machine with separate disks to isolate I/O.
2. Information Collection
The team standardizes log collection by using TiDB versions ≥ 2.1.8, where slow‑query logs follow a MySQL‑compatible format. Logs are ingested via Flume, processed centrally, and visualized with real‑time slow‑query dashboards.
3. Monitoring & Alerting
TiDB’s native alerts are curated to reduce noise and focus on actionable events. Alerts are aggregated, prioritized, and presented in a simplified form for developers who may not understand raw TiDB metrics.
4. Business Release Process
Before production rollout, DBA teams review table schemas, indexes, and SQL lists, enforcing the use of FORCE INDEX to mitigate optimizer issues. SQL Plan Management (experimental in TiDB 3.0) is evaluated. Performance targets are set to 99.9 % of queries under 100 ms and 99 % under 10 ms. For high‑contention updates, Zhuanzhuan employs a custom distributed lock (ZZLock) to emulate pessimistic locking and reduce retry latency. Data extraction relies on TiDB’s binlog, Pump, and Drainer components to stream changes to downstream systems.
Future Plans
Zhuanzhuan is piloting TiDB containers with PingCAP and plans to migrate workloads to the cloud to further reduce operational costs.
Q&A Highlights
Q1: Largest cluster runs the IM service with hundreds of nodes, data volume in the hundreds of billions, and peak QPS in the high‑hundreds of thousands.
Q2: Production runs versions 2.0, 2.0.5, 2.1.7, and 2.1.8; 2.1.8 is the default because earlier versions remain stable and upgrade timing aligns with business needs.
Q3: Each cluster hosts a single logical database; isolation is handled at the application level.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
