Operations 14 min read

Evolution of Zhihu's Application Deployment System: From Physical Machines to Cloud‑Native Kubernetes

This article details the design and evolution of Zhihu's deployment platform, covering its early physical‑machine system, the transition to container orchestration with Mesos and Kubernetes, and advanced features such as blue‑green, canary releases, pre‑deployment, and branch deployments that enable rapid, reliable continuous delivery for large‑scale internet services.

Architecture Digest
Architecture Digest
Architecture Digest
Evolution of Zhihu's Application Deployment System: From Physical Machines to Cloud‑Native Kubernetes

Application deployment is a critical part of software development, especially for internet companies that need fast iteration and continuous delivery while minimizing change and error costs. This article introduces the evolution of Zhihu's deployment platform from its inception to its current state, offering practical insights.

Zhihu's deployment system, built by the Engineering Efficiency team, serves almost all business services with roughly 2,000 daily deployments. With blue‑green deployment enabled, most production releases complete in under 10 seconds (excluding canary verification).

Supports container and physical‑machine deployments, covering online services, offline services, scheduled tasks, and static files.

Provides office‑network pre‑release capability.

Offers canary verification with fault detection and automatic rollback.

Enables blue‑green deployment with second‑level switch‑over and rollback.

Allows deployment of Merge Request code for debugging.

Technical Background

Before describing the deployment system, a brief overview of Zhihu's infrastructure and network topology is provided.

Zhihu Network Layout

The network is divided into three isolated parts:

Production network: external online servers, fully isolated for security.

Testing network: isolated from production; used for pre‑deployment testing.

Office network: internal staff network that can access both testing and production via jump hosts.

Traffic Management

Zhihu uses Nginx + HAProxy to route traffic. Developers configure locations in Nginx, HAProxy maps traffic to real servers, and also handles load balancing, rate limiting, and circuit breaking.

Continuous Integration

Jenkins + Docker are used for CI; the CI process generates immutable artifacts that serve as the basis for deployments.

Physical‑Machine Deployment

Initially, Zhihu relied on physical‑machine deployments with custom scripts, which were slow, risky, and hard to roll back. Around 2015, the first deployment system named nami (inspired by the One Piece character) was created.

nami used Fabric to upload CI artifacts to physical machines, extracted them, and managed processes with Supervisor.

Application (App) and Service (Unit)

Each GitLab repository corresponds to an application, but a single codebase may run multiple services (e.g., API, scheduled tasks, Celery workers). Users configure start commands, parameters, and environment variables for each Unit via the deployment UI.

Candidate Version

Every deployment is based on a CI‑generated artifact, called a Candidate version. Typically, a Candidate corresponds to a Merge Request.

Deployment Stage

Deployments are split into multiple stages (e.g., Build, Test, Office, Canary 1, Canary 2, Production). Each stage can be set to auto‑deploy, enabling continuous deployment pipelines.

Service Registration and Discovery

Before deploying to a physical machine, the host is removed from Consul; after deployment it is re‑registered. HAProxy configuration is updated via Consul‑Template, and a custom library diplomat pulls service lists from Consul for RPC and other use cases.

Container Deployment

Legacy Container System (Bay)

In late 2015, Zhihu adopted Mesos and built an initial container orchestration system called Bay, which supported rolling updates but could take up to 18 minutes for large groups.

Feature Enhancements

Health checks (/check_health) were added for HTTP/RPC services, and online/offline services were split to use rolling or full‑replace strategies.

Pre‑Release and Canary Release

Office Network Pre‑Release

Traffic from the office network is split at the Nginx layer to a dedicated HAProxy, allowing validation of changes before they reach external users.

Canary Release

Two canary stages (1% and 20% of production containers) were introduced between the office and production stages. Automated canary monitoring compares metrics against production; if anomalies are detected, the canary containers are destroyed and developers are notified. If no issues appear within six minutes, the production stage proceeds.

New Container Deployment

To address Bay's speed and stability issues, the orchestration was migrated from Mesos to Kubernetes, resulting in the new system NewBay , which brings faster deployments and higher reliability.

Blue‑Green Deployment

NewBay implements true blue‑green deployment: new and old container groups coexist, and HAProxy switches traffic atomically, allowing second‑level rollbacks.

Pre‑Deployment

During the canary phase, full‑production containers are started asynchronously so that the final production switch only needs to redirect traffic, reducing total rollout time to seconds.

Branch Deployment

Deployments can also be triggered for Merge Requests, enabling developers to test changes in isolated containers before merging to the main branch.

Platformization of the Deployment System

The entire workflow is encapsulated in Zhihu App Engine (ZAE), a developer platform that provides UI for monitoring deployment progress, logs, and common operations.

Overall, Zhihu's deployment system has matured since 2015, playing a vital role in accelerating business iteration, reducing failures, and shaping the company's product release cadence.

ci/cdoperationsdeploymentkubernetesContinuous Deploymentcanary releaseblue-green
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.