TiDB Architecture, Deployment, and Monitoring Practices at Qunar
This article explains Qunar's transition from MySQL, Redis, and HBase to TiDB, detailing the background of distributed databases, TiDB's architecture, hardware selection, deployment automation, monitoring setup, and real‑world usage scenarios to address scalability and high‑availability challenges.
Qunar's DBA team, with extensive experience in MySQL and HBase, investigated TiDB and InnoDB memcached to address the limitations of traditional relational databases as data volume and latency requirements grew.
1. Background of Distributed Databases
Rapid internet growth caused data sizes to explode from hundreds of GB to hundreds of TB, making single‑node databases unsuitable for scalability and cost. Distributed databases emerged, with two major families: Google Spanner‑style shared‑nothing systems (e.g., TiDB, CockroachDB, OceanBase) and AWS Aurora‑style compute‑storage separation systems (e.g., PolarDB).
2. Current Data Storage at Qunar
Qunar uses three primary storage solutions: MySQL for most core data (limited by lack of horizontal scaling), Redis as a cache, and HBase for large‑scale logs and snapshots (offering linear write scalability but suffering from read latency, lack of SQL, JVM GC issues, and no cross‑row transactions).
To overcome these drawbacks, the DBA team began evaluating distributed databases in early 2017 and ultimately selected TiDB.
3. TiDB Architecture Overview
TiDB consists of three components:
(1) TiDB Server – stateless SQL layer that receives queries, resolves data locations via PD, and can be horizontally scaled behind load balancers. (2) PD Server – Placement Driver that stores metadata, performs scheduling and load balancing, and allocates globally unique transaction IDs. (3) TiKV Server – key‑value storage engine that manages data in Region units, replicates via Raft, and provides consistency and fault tolerance.
TiDB’s core strengths are horizontal scalability and high availability: multiple TiDB servers handle SQL traffic, while PD and TiKV ensure data redundancy through Raft.
4. TiDB Principles and Implementation
The system separates the SQL layer from the KV storage layer. TiKV uses RocksDB (a high‑performance single‑node engine) with Raft replication to achieve durability. Multi‑version concurrency control and distributed transactions are built on top of this KV store, allowing MySQL‑compatible access.
Data is sharded into Regions (the smallest scheduling unit). Regions can be partitioned by hash or range; TiKV adopts range partitioning. Each Region forms a Raft group with multiple replicas, enabling load balancing and fault tolerance.
5. Hardware Selection and Deployment Plan
TiDB and PD have modest disk I/O needs, so ordinary disks suffice, while TiKV requires higher I/O; SSDs are recommended. The recommended configuration uses four 600 GB SAS disks per TiKV machine, running four TiKV instances per host, with location labels for replica placement. A 10 GbE network is strongly advised.
Typical deployment: three servers host both TiDB and PD, and at least three separate servers host TiKV (three replicas per Region). Deployment is automated with TiDB‑Ansible, which handles OS initialization, component installation, rolling upgrades, data cleanup, environment cleanup, and monitoring configuration.
6. Monitoring Solution
PingCAP provides a full monitoring stack based on Prometheus for metric collection and Grafana for visualization. Metrics are exported from client programs, collected via Pushgateway, scraped by Prometheus, and alerted through Alertmanager, with dashboards displayed in Grafana.
7. TiDB Usage at Qunar
Qunar adopted a cautious rollout: after extensive testing, TiDB clusters were first deployed for non‑critical workloads in August. Two clusters are now in production:
(1) An offline ticket‑statistics cluster replacing a 1.6 TB MySQL database, handling 10 GB daily growth and heavy OLAP queries. (2) A financial‑payment cluster consolidating monthly‑sharded MySQL tables, enabling full‑table analytics and offline reporting without impacting online performance. Data is synchronized from MySQL to TiDB via Syncer, allowing merge queries, joins, and OLAP workloads on TiDB.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.