Operations 14 min read

How Zhihu Scaled to 3,000 Jenkins Jobs with Docker‑Powered CI Pipelines

Zhihu built a Docker‑based Jenkins Pipeline system that now runs over 3,000 jobs daily, offering low entry cost, high customizability, language openness, fast stable builds, and a highly available, extensible cluster while reducing debugging effort and enforcing quality standards.

dbaplus Community

Jan 31, 2019

How Zhihu Scaled to 3,000 Jenkins Jobs with Docker‑Powered CI Pipelines

Background

Zhihu adopted Jenkins for its flexibility and extensive plugin ecosystem. Early on each developer manually created a few jobs, but as the number of services grew to thousands the manual approach became unsustainable.

Developers needed to understand Jenkins configuration and trigger logic, raising creation and maintenance cost.

Physical‑machine builds caused version conflicts and inconsistent behavior after deployment.

Failed builds required developers to SSH into Jenkins slaves, creating permission‑control challenges.

These pain points motivated a system that simplifies application onboarding and automates build‑deploy workflows.

Full lifecycle

The workflow supports two main scenarios: only the master branch can be deployed, while any branch can be built; and all changes to master must go through a Merge Request (MR), which is built after a simulated merge with master to avoid test contamination.

A commit follows these steps:

Developer pushes code to GitLab.

GitLab triggers a webhook to Zhihu App Engine (ZAE).

ZAE passes repository ID and application context to the build system Lavie, which handles MR and master‑branch events. Lavie reads a YAML configuration from the repository, generates a Jenkinsfile, builds a Docker image, runs the container, and executes the defined build and test steps.

On success, artifacts are uploaded to physical‑machine, container, or offline platforms, and Slack notifies the user.

The user selects a candidate version in ZAE for deployment.

Common steps such as code checkout, database preparation, test coverage, and artifact registration are handled centrally; applications only need to provide a convention‑based YAML file.

Goals and challenges

Low entry cost & high customizability

The YAML file declares environment, dependencies, build, test, and post‑build actions. Simple configurations can specify base image, build commands, and test commands; more complex setups allow custom dependencies, MySQL versions, and artifact definitions.

base_image: python2/jessie
build:
  - buildout
test:
  unittest:
    - bin/test --cover-package=pin --with-xunit --with-coverage --cover-xml

A richer example adds Node, custom deps, multiple artifact targets, and cache directories.

base_image: py_node/jessie
deps:
  - libffi-dev
build:
  - buildout
  - cd admin && npm install && gulp
test:
  deps:
    - mysql:5.7
  unittest:
    - bin/test --cover-package=lived,liveweb --with-xunit --with-coverage
coverage_test:
  report_fpath: coverage.xml
post_build:
  scripts:
    - /bin/bash scripts/release_sentry.sh
artifacts:
  targets:
    - docker
    - tarball
cache:
  directories:
    - admin/static/components
    - admin/node_modules

Language openness

All builds run in containers. Base language images (Python, Go, Java, Node, Rust, etc.) are prepared in advance; applications select the appropriate image and add system dependencies via deps. Dockerfiles are reviewed before use.

Reducing unstable builds

Cache is stored in HDFS keyed by image and dependencies, decoupling it from specific Jenkins slaves. Common caches (e.g., node_modules, .ivy2) are pre‑populated, and applications can declare additional cache paths.

Dependency stability

Internal mirrors for each language are maintained; Docker images embed these mirrors, and an HTTP proxy backs any external sources lacking internal mirrors.

Lower debugging cost

Developers can SSH directly into the failed container (via a custom docker‑ssh tool) without affecting other builds. Containers are retained for one day after failure for investigation.

Enforcing standards

The system mandates a test stage in every configuration. After tests, coverage reports are posted as comments on the MR, comparing current and master‑branch values. Critical applications have higher coverage thresholds, and the system suggests upgrades when more stable library versions appear.

High availability & scalability

Job scheduling

Jenkins Master only schedules; actual execution happens on labeled Jenkins Nodes (e.g., mysql:5.6, common). Labels match application requirements, and the master dispatches jobs accordingly.

High‑availability design

Each node runs on a physical machine hosting a Jenkins Slave, Docker daemon, and MySQL for test isolation. If a slave fails, its label is removed, preventing further scheduling. A dual‑master standby setup ensures the cluster remains operational if the primary master goes down.

Monitoring & alerting

Master availability and queue length.

Node online status and hit count.

Job execution time anomalies.

CPU, memory, and disk usage of cluster machines.

Future plans

Dynamic scaling of Jenkins slaves based on cluster load.

Automatic failover for nodes and master, with task migration.

Extend MR build checks to include more quality gates such as automated API tests.

References: Jenkinsfile documentation https://jenkins.io/doc/book/pipeline/jenkinsfile/, Jenkins website https://jenkins.io/.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Build Automation ci/cd DevOps continuous integration Jenkins

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.