Operations 15 min read

Evolution of Zhihu's Deployment System: From Physical Machines to Cloud‑Native Blue‑Green Deployments

This article details the evolution of Zhihu's deployment platform, covering its early physical‑machine system, the transition to container orchestration with Kubernetes, and the implementation of blue‑green, canary, and pre‑release strategies that enable fast, reliable continuous deployment.

Top Architect
Top Architect
Top Architect
Evolution of Zhihu's Deployment System: From Physical Machines to Cloud‑Native Blue‑Green Deployments

Application deployment is a crucial part of software development, especially for internet companies that need fast iteration and low change cost. This article introduces the evolution of Zhihu's deployment platform from its early physical‑machine system to the current cloud‑native solution.

Technical Background

Before describing the deployment system, Zhihu's basic infrastructure and network topology are introduced.

Zhihu Network

Zhihu's network consists of three isolated parts: production environment, test environment, and office network.

Zhihu network diagram

Traffic Management

Zhihu uses Nginx + HAProxy to route traffic. Developers configure locations in Nginx, HAProxy performs load balancing, rate limiting, and circuit breaking, and forwards traffic to real servers.

Online traffic architecture

Continuous Integration

Zhihu employs Jenkins + Docker for CI; artifacts generated by CI are used by the deployment system.

Physical‑Machine Deployment

Initially Zhihu deployed on physical machines using scripts, which were slow and risky. In 2015 the first deployment system “nami” was built on Fabric and Supervisor.

Physical‑machine deployment

App and Unit

Each GitLab repository corresponds to an application, but a single codebase may run multiple units (API service, scheduled tasks, Celery workers), each with its own start command and parameters.

Candidate Version

Every CI artifact represents an immutable candidate version, typically linked to a Merge Request.

Candidate version list

Deployment Stages

Deployments are split into stages: Build, Test, Office, Canary‑1, Canary‑2, Production. Each stage can be set to auto‑deploy, enabling continuous deployment.

Build/Deploy stages

Service Registration and Discovery

Consul and HAProxy Template are used for service registration; a library “diplomat” pulls service lists from Consul for RPC.

Container Deployment

Legacy Container System “Bay”

In late 2015 Zhihu adopted containers on Mesos with the “Bay” system, supporting rolling updates but with deployment times up to 18 minutes for large groups.

Bay container deployment

Feature Enhancements

Health checks, separation of online/offline services, and other improvements were added.

Pre‑Release and Canary Release

Office Pre‑Release

Traffic from the office network is split to a dedicated HAProxy, allowing early testing of merged code before public release.

Office traffic split

Canary Release

Two canary stages (1% and 20% of production traffic) are inserted before full production rollout. Automatic rollback is triggered if metrics deviate.

Canary monitoring

New Container Deployment

Bay was replaced by “NewBay” on Kubernetes, bringing faster deployments and blue‑green capabilities.

Blue‑Green Deployment

NewBay keeps old and new container groups simultaneously and switches traffic atomically via HAProxy, achieving second‑level rollbacks.

Blue‑green deployment

Pre‑Deployment

Containers for the production stage are started asynchronously during canary phases, reducing final rollout time to seconds.

Pre‑deployment reduces launch time

Branch Deployment

Deployments can also be triggered for merge‑request branches, enabling developers and QA to test changes before merging to main.

Multiple MR deployments

Platformization – ZAE

All development processes are unified into the Zhihu App Engine (ZAE), providing a UI for deployment progress, logs, and operations.

ZAE developer platform

Conclusion

The Zhihu deployment system, now mature, demonstrates how a well‑designed deployment platform accelerates business iteration, improves reliability, and influences product release cadence.

OperationsDeploymentkubernetesContinuous Integrationblue-greencanary
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.