How Meituan Scaled Its CI/CD Pipeline to 100K Daily Runs with 99.99% Success
This article details Meituan's three‑year journey building a self‑developed, distributed pipeline engine that now handles nearly 100,000 daily executions across dozens of services with over 99.99% reliability, covering the challenges faced, architectural decisions, scheduling and resource‑pool designs, and future cloud‑native plans.
Background
Continuous delivery has become a core practice for software teams. After three years of development, Meituan built a unified server‑side pipeline engine that processes close to 100,000 pipeline executions per day across many business lines (e.g., Meituan Store, Delivery, Platform) with a success rate above 99.99%.
Evolution of the Engine
Three development stages were identified:
2014‑2015 – Unified Jenkins cluster to provide SSO, repository integration, notifications, and dynamic agent scaling.
2016‑2018 – Split into multiple Jenkins clusters to alleviate single‑cluster bottlenecks, which introduced operational complexity and security concerns.
2019‑present – Designed a custom distributed pipeline engine (internal project name “Pipeline”) to remove single‑machine limits and eliminate duplicated tooling.
Key problems were low scheduling efficiency, resource contention, heterogeneous tool integration, and the need for extensible low‑impact enhancements.
Solution Overview
Separation of Scheduling Decision and Resource Allocation
The engine separates the Decision Service (calculates which jobs can run and records decisions) from the Worker (pulls jobs and assigns execution resources). Both modules are horizontally scalable, improving throughput and availability while allowing independent evolution.
Resource‑Pool Model
Execution environments are abstracted as resource pools. Three pool types are defined:
Pre‑provisioned public pool : Reserved for high‑frequency, latency‑sensitive jobs; size is adjusted based on historical usage.
On‑demand pool : Created in real time for jobs that cannot be satisfied by the public pool, improving overall utilization.
External platform pool : Managed by third‑party platforms that control pull frequency and throughput.
Jobs and pools are linked via tags , enabling flexible matching and priority handling.
Three‑Layer Component Architecture
Business Logic Layer : Implements specific job behavior and adapts to diverse development scenarios.
System Interaction Layer : Exposes a uniform process interface ( init(), run(), queryResult(), uploadArtifacts()) that shields components from engine internals.
Execution Resource Layer : Supports multiple delivery forms (container images, plugins, standalone services) to accommodate varied tool integrations.
This layering standardizes component interaction while allowing extensions such as job cancellation, manual approvals, and asynchronous result handling.
Overall Architecture
The engine consists of five core modules:
Trigger : Handles entry events (Git push, PR, API, schedule).
Task Center : Distributed storage that maintains pipeline and job states; provides APIs for other modules.
Decision Service : Evaluates pending jobs, applies run‑order, condition filtering, and priority rules, then updates the Task Center.
Worker : Long‑polls the Task Center, pulls scheduled jobs, executes them, and reports results.
Component SDK : Wraps component execution and synchronizes status with the engine.
The Task Center is the single source of truth; Decision Service and Workers operate independently, enabling high concurrency and fault tolerance.
Core Design Details
Job Scheduling Flow
A typical pipeline (checkout → parallel code‑scan & build → deploy) demonstrates the collaboration of modules. Jobs transition through states: unstart → pending → scheduled → running → completed/failed . Optimistic locking and periodic monitoring provide compensation for network or database anomalies.
Resource‑Pool Partitioning
Jobs are labeled by component and pipeline dimensions, forming one‑to‑one queues. Resource pools have many‑to‑many relationships with labels, allowing a pool to serve multiple queues and improving utilization. Tags also encode priority, ensuring critical pipelines receive resources first.
Queue Splitting
When a job is enqueued, its tags determine the target queue. Workers poll queues in a round‑robin fashion, respecting per‑request job limits. This avoids locking multiple queues simultaneously and reduces contention.
Component Interaction Flow
The component state machine and a set of standardized APIs guarantee consistent lifecycle management. New event types (e.g., cancellation, manual approval) are added to the pull request without altering the core flow.
State Machine and Compensation
The state machine progresses via events (decision, pull, ACK, result report). To handle failures:
Missing decision events trigger periodic monitoring that re‑issues decisions.
Duplicate decisions are prevented by a pending state and database optimistic locks.
State‑change anomalies are mitigated by a final‑consistency approach: database updates precede queue insertion, and compensating jobs re‑enqueue if inconsistencies are detected.
Worker loss or timeout is handled by ACK timeouts that revert jobs to pending for re‑pull.
Decision Process
Decision Service selects runnable jobs from the unstart set using three sub‑steps:
Run‑order handling : Jobs are assigned a numeric runOrder; jobs with the same order can run in parallel, while higher orders wait for lower ones to finish.
Condition filtering : A chain of global and user‑defined conditions may skip or reuse previous results, reducing unnecessary executions.
Priority weighting : Besides timestamp fairness, pipelines receive weight values (e.g., release pipelines > test pipelines; manual > scheduled) to prioritize critical work.
Resource‑Pool Model Details
Jobs are assigned to queues based on a two‑dimensional tag:
Component dimension : Groups jobs by resource requirements (e.g., SSD, dev environment) and creates dedicated public pools.
Pipeline dimension : Reflects business‑level isolation needs; some pipelines obtain exclusive pools, others share pools with weight‑based guarantees.
Tags map 1:1 to queues (simplifying operations) and many‑to‑many to pools (maximizing utilization). When a queue is back‑logged, the missing resource tag is quickly identified, and the impact scope is limited to the affected queues.
Queue Pull Design
Workers request a batch of jobs; the engine iterates over the relevant queues in a round‑robin order until the batch size limit is reached or all queues are empty. Randomized tag ordering reduces lock contention across concurrent workers.
Component Layering and Extensibility
Components implement the standard lifecycle methods defined in the System Interaction Layer. Extensions such as job cancellation, manual approval callbacks, or asynchronous result handling are expressed as additional event types in the pull request, leaving the core flow untouched.
Adapter patterns provide default implementations for common scenarios (e.g., Shell scripts) while allowing dynamic injection of custom commands without coupling business logic to the engine.
Future Directions
Explore serverless and other cloud‑native technologies to create lighter, more elastic resource‑management solutions, focusing on elasticity, startup acceleration, and environment isolation.
Provide a one‑stop development‑to‑operation platform for component creators, lowering development and operational costs and fostering a vibrant component ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
