Databases 15 min read

How Zhihu Scaled to Trillions of Rows with TiDB: Architecture and Lessons

This article details Zhihu's Moneta service handling over a trillion rows, the scalability challenges it faced, the evaluation of MySQL sharding and MHA, and how adopting the open‑source HTAP database TiDB enabled millisecond‑level query latency, high availability, and easy horizontal expansion.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
How Zhihu Scaled to Trillions of Rows with TiDB: Architecture and Lessons

Background

Moneta, the service that records which posts a user has already read on Zhihu, stores about 1.3 trillion rows and generates roughly 100 billion new rows each month. The volume is expected to reach 3 trillion rows within two years. The system must keep query latency below 90 ms while supporting massive write throughput.

System Requirements

High availability : Users must never see already‑read posts in the recommendation feed.

Outstanding performance : The service must handle peak write rates of >40 k rows/s and query loads of >30 k QPS across 1.2 million posts.

Easy scalability : Architecture should grow horizontally as data volume expands.

Problems with Existing MySQL Architecture

Sharding

Application code becomes complex and hard to maintain.

Changing the shard key is cumbersome.

Upgrading application logic impacts availability.

Master High Availability (MHA)

Requires custom scripts or third‑party tools for virtual IP configuration.

Only monitors the primary node; no read‑load‑balancing for replicas.

Needs password‑less SSH, introducing security risks.

TiDB Overview

TiDB is an open‑source, MySQL‑compatible NewSQL database that provides Hybrid Transaction/Analytical Processing (HTAP). Its core components are:

TiDB Server : Stateless SQL layer that parses MySQL‑compatible queries and forwards them to the storage layer.

TiKV Server : Distributed transactional key‑value store that uses the Raft consensus protocol for strong consistency and automatic failover.

Placement Driver (PD) : Etcd‑backed metadata service that schedules TiKV regions and manages cluster topology.

TiSpark : Apache Spark plugin that enables complex OLAP queries on top of TiKV.

TiDB also provides an ecosystem of tools such as Ansible deployment scripts, Syncer for MySQL migration, TiDB Binlog for incremental backup, TiDB Data Migration (DM) for change‑data‑capture, and TiDB Lightning for fast bulk import.

TiDB Deployment in Moneta

The architecture was reorganized into three logical layers:

Top layer : Stateless, horizontally scalable client API and proxy.

Middle layer : Soft‑state services and a layered Redis cache; data persisted in TiDB enables rapid recovery after failures.

Bottom layer : TiDB cluster storing all stateful data; Raft replication provides self‑healing on node loss.

Kubernetes orchestrates the entire stack, providing global health monitoring and automated failover.

Performance After Migration

Peak write throughput: 40 000 rows per second.

Peak query load: 30 000 queries per second across 1.2 million posts.

99th‑percentile latency ≈ 25 ms; 999th‑percentile latency ≈ 50 ms, well under the 90 ms target.

Key charts:

Write rows per second
Write rows per second
Query volume per second
Query volume per second
99th percentile latency
99th percentile latency
999th percentile latency
999th percentile latency

Key Learnings

Fast data import : Using TiDB DM to capture MySQL binlog and TiDB Lightning for bulk load, 1.1 trillion records were imported in four days (versus an estimated month with traditional methods).

Reduced query latency : Separate TiDB instances for latency‑sensitive and non‑sensitive workloads prevent large analytical queries from affecting short‑response paths.

SQL hints and low‑precision timestamps : Custom optimizer hints guide plan selection; TiDB’s TSO timestamps and prepared statements cut round‑trip latency.

Resource evaluation : Raft‑based replication requires at least three replicas, increasing hardware needs compared with a single‑master MySQL setup, but provides self‑healing and horizontal scalability.

TiDB 3.0 Enhancements

Titan engine : A key‑value storage engine that reduces write amplification for large values, dramatically lowering write and query latency compared with the default RocksDB engine.

Table partitioning : Allows time‑range partitions, enabling queries to scan only recent partitions and improving performance for recent data.

gRPC batch messaging & multi‑threaded Raftstore : Batch Raft messages over gRPC and process Raft regions on multiple threads, boosting write throughput and concurrency.

SQL plan management : Binds a query to a specific execution plan without modifying the SQL text, simplifying optimizer tuning.

TiFlash : Column‑store extension that provides high‑compression analytics with Raft‑consistent replication, eliminating the need for separate ETL pipelines.

Performance comparison of RocksDB vs. Titan:

Write and query latency: RocksDB vs Titan
Write and query latency: RocksDB vs Titan

Future Plans

Because TiDB is MySQL‑compatible, existing applications can adopt it with minimal changes while benefiting from horizontal scalability to handle over a trillion records. The team will continue to contribute to the TiDB open‑source ecosystem and work with PingCAP to further strengthen the platform.

Code example

点击上方☝
码猿技术专栏
轻松关注!
及时获取有趣有料的技术文章
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceScalabilityTiDBHTAPNewSQLdatabase migration
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.