How Zhihu Scaled to 3,000 Jenkins Jobs with Docker‑Powered CI Pipelines
Zhihu built a Docker‑based Jenkins Pipeline system that now runs over 3,000 jobs daily, offering low entry cost, high customizability, language openness, fast stable builds, and a highly available, extensible cluster while reducing debugging effort and enforcing quality standards.
Background
Zhihu adopted Jenkins for its flexibility and extensive plugin ecosystem. Early on each developer manually created a few jobs, but as the number of services grew to thousands the manual approach became unsustainable.
Developers needed to understand Jenkins configuration and trigger logic, raising creation and maintenance cost.
Physical‑machine builds caused version conflicts and inconsistent behavior after deployment.
Failed builds required developers to SSH into Jenkins slaves, creating permission‑control challenges.
These pain points motivated a system that simplifies application onboarding and automates build‑deploy workflows.
Full lifecycle
The workflow supports two main scenarios: only the master branch can be deployed, while any branch can be built; and all changes to master must go through a Merge Request (MR), which is built after a simulated merge with master to avoid test contamination.
A commit follows these steps:
Developer pushes code to GitLab.
GitLab triggers a webhook to Zhihu App Engine (ZAE).
ZAE passes repository ID and application context to the build system Lavie, which handles MR and master‑branch events. Lavie reads a YAML configuration from the repository, generates a Jenkinsfile, builds a Docker image, runs the container, and executes the defined build and test steps.
On success, artifacts are uploaded to physical‑machine, container, or offline platforms, and Slack notifies the user.
The user selects a candidate version in ZAE for deployment.
Common steps such as code checkout, database preparation, test coverage, and artifact registration are handled centrally; applications only need to provide a convention‑based YAML file.
Goals and challenges
Low entry cost & high customizability
The YAML file declares environment, dependencies, build, test, and post‑build actions. Simple configurations can specify base image, build commands, and test commands; more complex setups allow custom dependencies, MySQL versions, and artifact definitions.
base_image: python2/jessie
build:
- buildout
test:
unittest:
- bin/test --cover-package=pin --with-xunit --with-coverage --cover-xmlA richer example adds Node, custom deps, multiple artifact targets, and cache directories.
base_image: py_node/jessie
deps:
- libffi-dev
build:
- buildout
- cd admin && npm install && gulp
test:
deps:
- mysql:5.7
unittest:
- bin/test --cover-package=lived,liveweb --with-xunit --with-coverage
coverage_test:
report_fpath: coverage.xml
post_build:
scripts:
- /bin/bash scripts/release_sentry.sh
artifacts:
targets:
- docker
- tarball
cache:
directories:
- admin/static/components
- admin/node_modulesLanguage openness
All builds run in containers. Base language images (Python, Go, Java, Node, Rust, etc.) are prepared in advance; applications select the appropriate image and add system dependencies via deps. Dockerfiles are reviewed before use.
Reducing unstable builds
Cache is stored in HDFS keyed by image and dependencies, decoupling it from specific Jenkins slaves. Common caches (e.g., node_modules, .ivy2) are pre‑populated, and applications can declare additional cache paths.
Dependency stability
Internal mirrors for each language are maintained; Docker images embed these mirrors, and an HTTP proxy backs any external sources lacking internal mirrors.
Lower debugging cost
Developers can SSH directly into the failed container (via a custom docker‑ssh tool) without affecting other builds. Containers are retained for one day after failure for investigation.
Enforcing standards
The system mandates a test stage in every configuration. After tests, coverage reports are posted as comments on the MR, comparing current and master‑branch values. Critical applications have higher coverage thresholds, and the system suggests upgrades when more stable library versions appear.
High availability & scalability
Job scheduling
Jenkins Master only schedules; actual execution happens on labeled Jenkins Nodes (e.g., mysql:5.6, common). Labels match application requirements, and the master dispatches jobs accordingly.
High‑availability design
Each node runs on a physical machine hosting a Jenkins Slave, Docker daemon, and MySQL for test isolation. If a slave fails, its label is removed, preventing further scheduling. A dual‑master standby setup ensures the cluster remains operational if the primary master goes down.
Monitoring & alerting
Master availability and queue length.
Node online status and hit count.
Job execution time anomalies.
CPU, memory, and disk usage of cluster machines.
Future plans
Dynamic scaling of Jenkins slaves based on cluster load.
Automatic failover for nodes and master, with task migration.
Extend MR build checks to include more quality gates such as automated API tests.
References: Jenkinsfile documentation https://jenkins.io/doc/book/pipeline/jenkinsfile/, Jenkins website https://jenkins.io/.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
