How China Mobile Zhejiang Built a Private Cloud with MESOS – Key Lessons
This article details China Mobile Zhejiang's journey from early virtualization to a full private‑cloud platform built on MESOS, covering why MESOS was chosen, the evolution of their cloud stages, DCOS implementation, automatic scaling, service discovery, and the operational benefits achieved.
Why Use MESOS
After years of traditional network, middleware, and host operations, Zhejiang Mobile faced massive deployment and high‑availability challenges across more than a hundred systems, prompting a shift toward cloud‑driven IT architecture.
1. Cloud Computing Drives Enterprise IT Evolution
Initial virtualization (IaaS) in 2012 did not solve the core problems; a unified private‑cloud platform was envisioned in 2014‑2015 to break the siloed application architecture.
2. Typical Cloud Platforms
Public cloud leaders such as AWS and Google’s Borg‑based private cloud illustrate the benefits of shared resource pools and centralized scheduling.
Zhejiang Mobile Cloud‑Adoption Stages
1. Cloud‑Adoption Timeline
Stage 1: Small‑scale mainframe (IOE) architecture with siloed systems.
Stage 2: 2009 – introduction of x86 servers, standardizing hardware and reducing deployment cycles.
Stage 3: Around 2013 – adoption of VMware and later KVM for virtualization.
Stage 4: Distributed application refactoring to enable horizontal scaling.
Despite virtualization, application deployment cycles remained long, and cross‑datacenter migration required extensive networking.
2. Problems Encountered
Key issues included static deployments, limited virtualization elasticity, lack of automated packaging, multi‑environment deployment overhead, slow scaling during traffic spikes, limited VM‑level elasticity, and low resource utilization.
3. Data‑Center Operating System Concept Inspired by Google’s Borg, the third‑generation PaaS (data‑center OS) aims to provide datacenter‑level elasticity, automated fault recovery, fine‑grained resource allocation, high utilization, and rapid deployment. 4. Typical Solutions MESOS (originating from Borg) and YARN serve as resource schedulers; Docker handles process management; Kubernetes, Swarm, and Cloud Foundry offer alternative PaaS platforms. China Mobile Zhejiang DCOS Practice 1. DCOS Development Timeline Mar 2014 – started exploring Docker; Aug 2014 – Docker pilot. Nov 2014 – migrated core CRM cluster to containers; Docker entered production. Aug 2015 – proposed data‑center OS, built a validation network using MESOS + Marathon + Docker. Nov 2015 – DCOS validation network launched; supported the “Double‑11” sales event. Dec 2015 – CRM application went live on the MESOS platform. 2. MESOS‑Based DCOS Implementation Resource Scheduling MESOS provides a two‑level scheduler: a master node receives resource offers from slaves and allocates them to frameworks such as Marathon. Task Scheduling Long‑running services are scheduled via Marathon, which launches containers on allocated resources. Application Packaging Docker is used to package applications, enabling consistent deployment across the cluster. Service Discovery & Registration Containers register with HAProxy through Marathon events, allowing dynamic load‑balancing. DCOS Architecture The cluster consists of MESOS masters, Marathon, HAProxy, and compute nodes (physical or virtual), deployed across two data centers with hardware load balancers for external traffic. 3. Operational Experience Automatic Elastic Scaling An auto‑scaling module monitors business concurrency and other health metrics to adjust container counts in real time. Marathon‑Etcd Integration for Service Discovery Etcd is integrated with Marathon events to achieve immediate service registration without per‑host agents. Data‑Center Switching Applications can be migrated wholesale between data centers for upgrades or disaster recovery, minimizing service disruption. Benefits of MESOS Resource Utilization: All applications share a common compute pool, improving overall usage. Cross‑Data‑Center Scheduling: MESOS operates over a three‑tier network, eliminating the need for additional networking layers. Elastic Scaling: Scaling time is reduced to the application start‑up duration, often under a minute for lightweight services. High Availability & Disaster Recovery: The platform isolates application failures, and container‑based deployment minimizes coupling with underlying infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
