How Meituan Scaled Its CI/CD Pipeline Engine to 100k Daily Jobs with 99.99% Success
This article details Meituan's three‑year journey building a self‑developed pipeline engine that now handles nearly 100,000 daily executions with over 99.99% reliability, covering background, challenges, architectural decisions, core scheduling and resource‑pool designs, component layering, and future cloud‑native plans.
Background
Continuous delivery emerged in 2006 and has become essential for improving development efficiency. Traditional pipelines rely on tools like Jenkins or GitLab CI, which Meituan initially adopted to quickly support business needs. As usage grew, limitations such as inconsistent standards, high maintenance costs, and scaling bottlenecks became evident.
Meituan evolved through three stages: (1) 2014‑2015 unified Jenkins clusters for common problems; (2) 2016‑2018 split multiple Jenkins clusters to alleviate single‑cluster bottlenecks, which introduced operational complexity and security risks; (3) 2019‑present built a custom distributed pipeline engine (named Pipeline ) to eliminate single‑machine limits and consolidate infrastructure across all business lines.
Problem & Approach
Business Overview
A pipeline is a directed acyclic graph that processes code through stages such as build, test, and deployment. Components encapsulate reusable tool actions, while jobs represent individual component executions. Resources provide the execution environment for jobs.
Main Challenges
Scheduling efficiency bottleneck : Short‑lived jobs (seconds to minutes) are sensitive to scheduling latency; existing monolithic schedulers (Jenkins, GitLab CI, Tekton) serialize dispatch, causing queue buildup during peak loads.
Resource allocation : Job count far exceeds available resources; static pre‑deployment resources improve latency but risk under‑utilization or starvation.
Tool heterogeneity : Diverse tools require a plugin‑style architecture that hides implementation differences from pipeline authors.
Solution Overview
Separate scheduling decisions from resource allocation, allowing independent horizontal scaling of each module.
Introduce a resource‑pool model with label‑based matching, enabling flexible sharing and isolation of resources.
Design a layered component architecture (business, system‑interaction, execution‑resource) to standardize interfaces while supporting varied tool integrations.
Overall Architecture
The engine consists of five core modules:
Trigger : Handles various sources (PR, push, API, schedule).
Task Center : Stores pipeline and job state, provides APIs for execution, cancellation, and retries.
Decision Engine : Determines which pending jobs can be scheduled and updates their state.
Worker : Pulls scheduled jobs, allocates execution resources, and reports results.
Component SDK : Wraps component logic, managing initialization, execution, result upload, and status updates.
Core Design Points
Job Scheduling Design
Scheduling is split into two phases: decision and resource allocation. The decision module computes a set of jobs ready for execution and marks them pending. Workers then pull scheduled jobs based on label‑matched queues. This decoupling enables horizontal scaling and independent evolution of scheduling logic.
Key mechanisms include:
Label‑based queues to isolate high‑priority pipelines.
Optimistic‑lock updates to avoid duplicate decisions.
Compensation tasks that re‑enqueue jobs when failures occur.
Resource‑Pool Design
Multiple queues are linked to resource pools via many‑to‑many label relationships. Pools can be:
Pre‑provisioned public resources for high‑frequency, latency‑sensitive jobs.
On‑demand resources that scale out when pool capacity is insufficient.
External platform resources managed by third‑party services.
Tags are two‑dimensional (component × pipeline) to simplify mapping jobs to pools while supporting isolation for critical business scenarios.
Component Layered Design
The component model is divided into three layers:
Business Layer : Adapters for diverse component needs without leaking differences upward.
System Interaction Layer : Uniform API contracts (init, run, queryResult, uploadArtifacts) that hide internal details.
Execution Resource Layer : Supports various execution forms (container images, plugins, standalone services) to accommodate tool heterogeneity.
Standardized interaction follows a template pattern with mandatory methods wrapped by the engine, allowing developers to focus on business logic.
Future Plans
Leverage serverless and other cloud‑native technologies to create lighter, more efficient resource management with fine‑grained elasticity, fast startup, and isolation.
Provide an end‑to‑end platform for component developers, covering development, deployment, and operation to lower entry barriers and foster a vibrant component ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
