Operations 28 min read

How Meituan Scaled Its CI/CD Pipeline to 100K Daily Runs with 99.99% Success

This article details Meituan's three‑year journey building a self‑developed, distributed pipeline engine that now handles nearly 100,000 daily executions across dozens of services with over 99.99% reliability, covering the challenges faced, architectural decisions, scheduling and resource‑pool designs, and future cloud‑native plans.

dbaplus Community
dbaplus Community
dbaplus Community
How Meituan Scaled Its CI/CD Pipeline to 100K Daily Runs with 99.99% Success

Background

Continuous delivery has become a core practice for software teams. After three years of development, Meituan built a unified server‑side pipeline engine that processes close to 100,000 pipeline executions per day across many business lines (e.g., Meituan Store, Delivery, Platform) with a success rate above 99.99%.

Evolution of the Engine

Three development stages were identified:

2014‑2015 – Unified Jenkins cluster to provide SSO, repository integration, notifications, and dynamic agent scaling.

2016‑2018 – Split into multiple Jenkins clusters to alleviate single‑cluster bottlenecks, which introduced operational complexity and security concerns.

2019‑present – Designed a custom distributed pipeline engine (internal project name “Pipeline”) to remove single‑machine limits and eliminate duplicated tooling.

Key problems were low scheduling efficiency, resource contention, heterogeneous tool integration, and the need for extensible low‑impact enhancements.

Solution Overview

Separation of Scheduling Decision and Resource Allocation

The engine separates the Decision Service (calculates which jobs can run and records decisions) from the Worker (pulls jobs and assigns execution resources). Both modules are horizontally scalable, improving throughput and availability while allowing independent evolution.

Resource‑Pool Model

Execution environments are abstracted as resource pools. Three pool types are defined:

Pre‑provisioned public pool : Reserved for high‑frequency, latency‑sensitive jobs; size is adjusted based on historical usage.

On‑demand pool : Created in real time for jobs that cannot be satisfied by the public pool, improving overall utilization.

External platform pool : Managed by third‑party platforms that control pull frequency and throughput.

Jobs and pools are linked via tags , enabling flexible matching and priority handling.

Three‑Layer Component Architecture

Business Logic Layer : Implements specific job behavior and adapts to diverse development scenarios.

System Interaction Layer : Exposes a uniform process interface ( init(), run(), queryResult(), uploadArtifacts()) that shields components from engine internals.

Execution Resource Layer : Supports multiple delivery forms (container images, plugins, standalone services) to accommodate varied tool integrations.

This layering standardizes component interaction while allowing extensions such as job cancellation, manual approvals, and asynchronous result handling.

Overall Architecture

The engine consists of five core modules:

Trigger : Handles entry events (Git push, PR, API, schedule).

Task Center : Distributed storage that maintains pipeline and job states; provides APIs for other modules.

Decision Service : Evaluates pending jobs, applies run‑order, condition filtering, and priority rules, then updates the Task Center.

Worker : Long‑polls the Task Center, pulls scheduled jobs, executes them, and reports results.

Component SDK : Wraps component execution and synchronizes status with the engine.

The Task Center is the single source of truth; Decision Service and Workers operate independently, enabling high concurrency and fault tolerance.

Core Design Details

Job Scheduling Flow

A typical pipeline (checkout → parallel code‑scan & build → deploy) demonstrates the collaboration of modules. Jobs transition through states: unstart → pending → scheduled → running → completed/failed . Optimistic locking and periodic monitoring provide compensation for network or database anomalies.

Resource‑Pool Partitioning

Jobs are labeled by component and pipeline dimensions, forming one‑to‑one queues. Resource pools have many‑to‑many relationships with labels, allowing a pool to serve multiple queues and improving utilization. Tags also encode priority, ensuring critical pipelines receive resources first.

Queue Splitting

When a job is enqueued, its tags determine the target queue. Workers poll queues in a round‑robin fashion, respecting per‑request job limits. This avoids locking multiple queues simultaneously and reduces contention.

Component Interaction Flow

The component state machine and a set of standardized APIs guarantee consistent lifecycle management. New event types (e.g., cancellation, manual approval) are added to the pull request without altering the core flow.

State Machine and Compensation

The state machine progresses via events (decision, pull, ACK, result report). To handle failures:

Missing decision events trigger periodic monitoring that re‑issues decisions.

Duplicate decisions are prevented by a pending state and database optimistic locks.

State‑change anomalies are mitigated by a final‑consistency approach: database updates precede queue insertion, and compensating jobs re‑enqueue if inconsistencies are detected.

Worker loss or timeout is handled by ACK timeouts that revert jobs to pending for re‑pull.

Decision Process

Decision Service selects runnable jobs from the unstart set using three sub‑steps:

Run‑order handling : Jobs are assigned a numeric runOrder; jobs with the same order can run in parallel, while higher orders wait for lower ones to finish.

Condition filtering : A chain of global and user‑defined conditions may skip or reuse previous results, reducing unnecessary executions.

Priority weighting : Besides timestamp fairness, pipelines receive weight values (e.g., release pipelines > test pipelines; manual > scheduled) to prioritize critical work.

Resource‑Pool Model Details

Jobs are assigned to queues based on a two‑dimensional tag:

Component dimension : Groups jobs by resource requirements (e.g., SSD, dev environment) and creates dedicated public pools.

Pipeline dimension : Reflects business‑level isolation needs; some pipelines obtain exclusive pools, others share pools with weight‑based guarantees.

Tags map 1:1 to queues (simplifying operations) and many‑to‑many to pools (maximizing utilization). When a queue is back‑logged, the missing resource tag is quickly identified, and the impact scope is limited to the affected queues.

Queue Pull Design

Workers request a batch of jobs; the engine iterates over the relevant queues in a round‑robin order until the batch size limit is reached or all queues are empty. Randomized tag ordering reduces lock contention across concurrent workers.

Component Layering and Extensibility

Components implement the standard lifecycle methods defined in the System Interaction Layer. Extensions such as job cancellation, manual approval callbacks, or asynchronous result handling are expressed as additional event types in the pull request, leaving the core flow untouched.

Adapter patterns provide default implementations for common scenarios (e.g., Shell scripts) while allowing dynamic injection of custom commands without coupling business logic to the engine.

Future Directions

Explore serverless and other cloud‑native technologies to create lighter, more elastic resource‑management solutions, focusing on elasticity, startup acceleration, and environment isolation.

Provide a one‑stop development‑to‑operation platform for component creators, lowering development and operational costs and fostering a vibrant component ecosystem.

Pipeline concept
Pipeline concept
Pipeline architecture
Pipeline architecture
Scheduling process
Scheduling process
Job state machine
Job state machine
Decision state
Decision state
ACK state
ACK state
Decision process
Decision process
Parallel decision
Parallel decision
Resource pool model
Resource pool model
Queue design
Queue design
Component architecture
Component architecture
Component standard flow
Component standard flow
Component extension
Component extension
Adapter design
Adapter design
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ci/cdResource ManagementPipelineMeituan
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.