How Zhihu Scales Its Read Service: Architecture, Performance, and TiDB Migration

This article explains how Zhihu built a highly available, high‑performance, and easily extensible read‑service for its homepage, detailing the system architecture, caching strategies, query performance requirements, and the migration from MySQL to TiDB with TiDB 3.0 enhancements.

Programmer DD
Programmer DD
Programmer DD
How Zhihu Scales Its Read Service: Architecture, Performance, and TiDB Migration

Business Scenario

Zhihu has grown from a Q&A platform to a massive knowledge content ecosystem with 30 million questions, over 130 million answers, and 220 million registered users. Efficiently delivering the most interesting content to users on the homepage is critical.

The read service records every piece of content a user has deeply read or skimmed, storing this data long‑term to filter already‑seen items from homepage recommendations and personalized push notifications.

Key characteristics of the service include extremely high availability (the homepage is a primary traffic channel), massive write volume (peak >40 k records per second, ~3 billion new records per day, three‑year retention, currently ~13 trillion records), and stringent query latency (90 ms end‑to‑end, with P99/P999 around 25 ms/50 ms under 12 million document reads per second).

Architecture Design

The system was designed around three goals: high availability, high performance, and easy scalability.

2.1 High Availability

Fault detection and self‑healing mechanisms are built into every component, allowing automatic recovery without human intervention and isolating failures from the business side.

2.2 High Performance

Data is partitioned into slots, each with multiple cache replicas to increase availability and distribute read load. Bloom filters are used to densify stored data, reducing memory consumption and improving cache hit rates. Write‑through caching, data change subscriptions, and read‑through designs further boost efficiency and reduce database pressure.

2.3 Easy Scalability

Stateless components (client API, proxy) can be scaled horizontally, while stateful components are kept weakly stateful or externalized (e.g., TiDB). Cache tagging isolates different business tenants, and multi‑layer caching addresses both space and time hotness.

2.4 Final Read‑Service Architecture

The top layer consists of stateless client APIs and proxies. The storage layer uses TiDB for durable state, with a layered Redis cache in front. Additional external components ensure cache consistency.

Key Components

3.1 Proxy

The proxy is stateless and routes requests to appropriate cache slots. If a slot’s replicas fail, the proxy can fall back to another slot, sacrificing performance but preserving availability.

3.2 Cache

Cache design focuses on high utilization: using Bloom filters for dense storage, write‑through updates, and read‑through queries to minimize database hits. Cache nodes are hot‑started by transferring active state from peers, and multi‑layer caching reduces cross‑data‑center traffic.

3.3 Storage

Initially MySQL with sharding and MHA was used, but the massive data scale prompted migration to TiDB, which offers MySQL compatibility, high availability via Raft, and better scalability.

3.4 Performance Metrics

Current load reaches 40 k writes per second, 30 k independent queries, and 12 million document reads per second, with P99/P999 latency remaining stable at 25 ms/50 ms.

All About TiDB

4.1 MySQL to TiDB

Data migration leveraged TiDB DM for incremental binlog capture and TiDB Lightning for fast bulk import (≈4 days for ~1.1 trillion records). Post‑migration tuning addressed latency sensitivity, including query isolation, SQL hints, low‑precision TSO, and prepared‑statement reuse.

4.2 TiDB 3.0

TiDB 3.0 introduced gRPC batch messages, multi‑threaded Raft stores, Plan Management, TiFlash for analytical workloads, and the Titan storage engine for large records, dramatically improving write/query latency for both the read service and an anti‑fraud system.

Conclusion

Designing a sustainable system requires deep understanding of business characteristics and building a universally applicable architecture. Open‑source contributions and a cloud‑native mindset further enhance the system’s robustness, scalability, and performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performancecloud-nativecachingTiDBhigh-availability
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.