Databases 12 min read

Scaling TiDB to 200TB at Zhuanzhuan: Key Performance and Management Lessons

Zhuanzhuan’s adoption of TiDB addressed sharding challenges and massive data storage needs, and the team shares six common issues encountered in large‑scale online deployments—including performance diagnosis, cluster management, log inconsistencies, slow‑SQL impact, optimizer limitations, and transaction conflicts—along with their standardized solutions for deployment, monitoring, alerting, and business rollout.

dbaplus Community
dbaplus Community
dbaplus Community
Scaling TiDB to 200TB at Zhuanzhuan: Key Performance and Management Lessons

Why Zhuanzhuan Adopted TiDB

Zhuanzhuan introduced TiDB primarily to eliminate the complexity of sharding and to handle petabyte‑scale data storage that exceeded the capacity of single‑node databases. By moving to a distributed SQL engine, the company aimed to simplify business logic, reduce development overhead, and improve cost efficiency.

Problems Faced in Large‑Scale Online Use

The team identified six recurring issues when operating dozens of TiDB clusters serving over 200 TB of data and billions of daily requests:

Performance diagnosis – Sudden increases in query latency required pinpointing offending SQL statements via TiDB monitoring dashboards, log inspection, and TiKV logs.

Cluster management – Managing dozens of clusters introduced challenges in deployment, version upgrades, and configuration compatibility, especially when PD instances conflicted under Systemd.

Inconsistent log formats – Different TiDB versions produced divergent slow‑query log schemas, making ETL and analysis costly.

Impact of slow SQL on stability – Heavy analytical or ETL queries could degrade overall cluster responsiveness.

Optimizer failing to hit indexes – Identical queries sometimes produced vastly different execution plans, leading to unpredictable performance.

Transaction conflicts – High contention on the same rows caused retries (retry_limit = 3) that dramatically slowed the cluster.

TiDB Cluster Standardization

1. Deployment Standardization

Zhuanzhuan deploys at least three TiDB servers for high availability, three PD servers, and six TiKV servers. Recommended hardware includes 10 GbE NICs for TiDB, 1 GbE or 10 GbE for PD, and SSDs for TiKV (single‑node capacity ≤ 400 GB).

Do not use TiDB for workloads smaller than 500 GB to avoid resource waste.

Limit each TiKV instance to ≤ 400 GB to shorten recovery time after failures.

Use 10 GbE NICs for TiDB/TiKV under high concurrency to prevent network bottlenecks.

Run multiple TiKV instances per machine with separate disks to isolate I/O.

2. Information Collection

The team standardizes log collection by using TiDB versions ≥ 2.1.8, where slow‑query logs follow a MySQL‑compatible format. Logs are ingested via Flume, processed centrally, and visualized with real‑time slow‑query dashboards.

3. Monitoring & Alerting

TiDB’s native alerts are curated to reduce noise and focus on actionable events. Alerts are aggregated, prioritized, and presented in a simplified form for developers who may not understand raw TiDB metrics.

4. Business Release Process

Before production rollout, DBA teams review table schemas, indexes, and SQL lists, enforcing the use of FORCE INDEX to mitigate optimizer issues. SQL Plan Management (experimental in TiDB 3.0) is evaluated. Performance targets are set to 99.9 % of queries under 100 ms and 99 % under 10 ms. For high‑contention updates, Zhuanzhuan employs a custom distributed lock (ZZLock) to emulate pessimistic locking and reduce retry latency. Data extraction relies on TiDB’s binlog, Pump, and Drainer components to stream changes to downstream systems.

Future Plans

Zhuanzhuan is piloting TiDB containers with PingCAP and plans to migrate workloads to the cloud to further reduce operational costs.

Q&A Highlights

Q1: Largest cluster runs the IM service with hundreds of nodes, data volume in the hundreds of billions, and peak QPS in the high‑hundreds of thousands.

Q2: Production runs versions 2.0, 2.0.5, 2.1.7, and 2.1.8; 2.1.8 is the default because earlier versions remain stable and upgrade timing aligns with business needs.

Q3: Each cluster hosts a single logical database; isolation is handled at the application level.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TiDBSQL OptimizationCluster Management
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.