How Zhihu Scaled from 2 Engineers to 100+ Servers: Backend Architecture Lessons
This article recounts Zhihu's evolution from a two‑person startup to a platform serving over 80 million monthly users, detailing early Python/Tornado choices, high‑availability, custom logging, event‑driven design, rendering optimizations, and the transition to a service‑oriented architecture.
Zhihu is the third‑largest Chinese UGC community after Baidu Tieba and Douban, growing from a handful of engineers to over 100 servers, with more than 11 million registered users and monthly page views exceeding 2.2 billion.
Early Architecture Choices
In October 2010 the product was built by two engineers, expanding to four by December. The core language is Python for its simplicity and strong community, and the Tornado framework was chosen for its asynchronous capabilities, lightweight nature, and support for long‑polling (comet) needed for real‑time feeds.
The first server was a 512 MB Linode VM, but rapid user growth and network latency forced a move to self‑hosted hardware. Early hardware suffered frequent crashes, prompting the implementation of web and database high‑availability with master‑slave replication and read‑write separation.
At this stage the architecture featured master‑slave databases, a third server for offline scripts, and upgraded internal network equipment that increased throughput twenty‑fold.
Redis became a critical component for queues, search, and caching; sharding and consistency mechanisms were added as single‑node storage became a bottleneck.
Logging System
When Zhihu opened registration to the public in late 2011, spam accounts required a robust logging solution. The team needed distributed collection, centralized storage, real‑time processing, subscription, and simplicity.
Existing open‑source options (Scribe, Kafka, Flume) were unsuitable, so Zhihu built its own system called Kids (Kids Is Data Stream) . Kids follows Scribe’s model with agents on each server that forward logs to a central server, supporting both pull and push subscriptions.
Kids also powers a web tool, Kids Explorer, for real‑time log inspection and has been open‑sourced on GitHub.
Event‑Driven Architecture
As features grew, maintaining procedural update logic became untenable. Zhihu introduced an event‑driven design, starting with a message queue that guarantees consistency and durability.
The Sink component persists incoming events locally before dispatching them. Failed machines can recover without loss. Sink hands tasks to the Miller framework, which enqueues them in Beanstalkd for parallel processing.
Initially the system handled 10 messages per second (70 tasks). After scaling, it processes 100 events per second, generating 1,500 tasks.
Page Rendering Optimization
By 2013 Zhihu served millions of daily page views, making rendering both CPU‑ and I/O‑intensive. The team modularized the front‑end into components and introduced a hierarchical data‑fetch strategy, reducing redundant requests.
A custom template engine called ZhihuNode was built, cutting the problem page load time from 500 ms to 150 ms and the feed page from 1 s to 600 ms.
Service‑Oriented Architecture (SOA)
Growing functionality led Zhihu to adopt SOA. The evolution of its RPC framework started with Wish (strict serialization, custom STP protocol), moved to Snow (JSON‑based, flexible but loosely defined), and finally to a pluggable framework supporting both JSON and Apache Avro with interchangeable transport layers.
A service registry enables discovery by name, and a tracing system built on Zipkin provides observability.
Services are organized into three layers—aggregation, content, and foundation—and classified as data, logic, or channel services. Data services handle specialized storage (e.g., images), logic services perform CPU‑heavy processing, and channel services act as routers (e.g., Sink).
The talk also mentioned new practices for Zhihu Columns built with AngularJS. A video of the presentation will be released on the site.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
