Evolution of Zhihu's Deployment System: From Physical Machines to Cloud‑Native Blue‑Green Deployments
This article details the evolution of Zhihu's deployment platform, covering its early physical‑machine system, the transition to container orchestration with Kubernetes, and the implementation of blue‑green, canary, and pre‑release strategies that enable fast, reliable continuous deployment.
Application deployment is a crucial part of software development, especially for internet companies that need fast iteration and low change cost. This article introduces the evolution of Zhihu's deployment platform from its early physical‑machine system to the current cloud‑native solution.
Technical Background
Before describing the deployment system, Zhihu's basic infrastructure and network topology are introduced.
Zhihu Network
Zhihu's network consists of three isolated parts: production environment, test environment, and office network.
Zhihu network diagram
Traffic Management
Zhihu uses Nginx + HAProxy to route traffic. Developers configure locations in Nginx, HAProxy performs load balancing, rate limiting, and circuit breaking, and forwards traffic to real servers.
Online traffic architecture
Continuous Integration
Zhihu employs Jenkins + Docker for CI; artifacts generated by CI are used by the deployment system.
Physical‑Machine Deployment
Initially Zhihu deployed on physical machines using scripts, which were slow and risky. In 2015 the first deployment system “nami” was built on Fabric and Supervisor.
Physical‑machine deployment
App and Unit
Each GitLab repository corresponds to an application, but a single codebase may run multiple units (API service, scheduled tasks, Celery workers), each with its own start command and parameters.
Candidate Version
Every CI artifact represents an immutable candidate version, typically linked to a Merge Request.
Candidate version list
Deployment Stages
Deployments are split into stages: Build, Test, Office, Canary‑1, Canary‑2, Production. Each stage can be set to auto‑deploy, enabling continuous deployment.
Build/Deploy stages
Service Registration and Discovery
Consul and HAProxy Template are used for service registration; a library “diplomat” pulls service lists from Consul for RPC.
Container Deployment
Legacy Container System “Bay”
In late 2015 Zhihu adopted containers on Mesos with the “Bay” system, supporting rolling updates but with deployment times up to 18 minutes for large groups.
Bay container deployment
Feature Enhancements
Health checks, separation of online/offline services, and other improvements were added.
Pre‑Release and Canary Release
Office Pre‑Release
Traffic from the office network is split to a dedicated HAProxy, allowing early testing of merged code before public release.
Office traffic split
Canary Release
Two canary stages (1% and 20% of production traffic) are inserted before full production rollout. Automatic rollback is triggered if metrics deviate.
Canary monitoring
New Container Deployment
Bay was replaced by “NewBay” on Kubernetes, bringing faster deployments and blue‑green capabilities.
Blue‑Green Deployment
NewBay keeps old and new container groups simultaneously and switches traffic atomically via HAProxy, achieving second‑level rollbacks.
Blue‑green deployment
Pre‑Deployment
Containers for the production stage are started asynchronously during canary phases, reducing final rollout time to seconds.
Pre‑deployment reduces launch time
Branch Deployment
Deployments can also be triggered for merge‑request branches, enabling developers and QA to test changes before merging to main.
Multiple MR deployments
Platformization – ZAE
All development processes are unified into the Zhihu App Engine (ZAE), providing a UI for deployment progress, logs, and operations.
ZAE developer platform
Conclusion
The Zhihu deployment system, now mature, demonstrates how a well‑designed deployment platform accelerates business iteration, improves reliability, and influences product release cadence.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.