Cloud Computing 23 min read

How China Mobile Zhejiang Built a Private Cloud with MESOS – Key Lessons

This article details China Mobile Zhejiang's journey from early virtualization to a full private‑cloud platform built on MESOS, covering why MESOS was chosen, the evolution of their cloud stages, DCOS implementation, automatic scaling, service discovery, and the operational benefits achieved.

Efficient Ops
Efficient Ops
Efficient Ops
How China Mobile Zhejiang Built a Private Cloud with MESOS – Key Lessons

Why Use MESOS

After years of traditional network, middleware, and host operations, Zhejiang Mobile faced massive deployment and high‑availability challenges across more than a hundred systems, prompting a shift toward cloud‑driven IT architecture.

1. Cloud Computing Drives Enterprise IT Evolution

Initial virtualization (IaaS) in 2012 did not solve the core problems; a unified private‑cloud platform was envisioned in 2014‑2015 to break the siloed application architecture.

2. Typical Cloud Platforms

Public cloud leaders such as AWS and Google’s Borg‑based private cloud illustrate the benefits of shared resource pools and centralized scheduling.

Zhejiang Mobile Cloud‑Adoption Stages

1. Cloud‑Adoption Timeline

Stage 1: Small‑scale mainframe (IOE) architecture with siloed systems.

Stage 2: 2009 – introduction of x86 servers, standardizing hardware and reducing deployment cycles.

Stage 3: Around 2013 – adoption of VMware and later KVM for virtualization.

Stage 4: Distributed application refactoring to enable horizontal scaling.

Despite virtualization, application deployment cycles remained long, and cross‑datacenter migration required extensive networking.

2. Problems Encountered

Key issues included static deployments, limited virtualization elasticity, lack of automated packaging, multi‑environment deployment overhead, slow scaling during traffic spikes, limited VM‑level elasticity, and low resource utilization.

3. Data‑Center Operating System Concept Inspired by Google’s Borg, the third‑generation PaaS (data‑center OS) aims to provide datacenter‑level elasticity, automated fault recovery, fine‑grained resource allocation, high utilization, and rapid deployment. 4. Typical Solutions MESOS (originating from Borg) and YARN serve as resource schedulers; Docker handles process management; Kubernetes, Swarm, and Cloud Foundry offer alternative PaaS platforms. China Mobile Zhejiang DCOS Practice 1. DCOS Development Timeline Mar 2014 – started exploring Docker; Aug 2014 – Docker pilot. Nov 2014 – migrated core CRM cluster to containers; Docker entered production. Aug 2015 – proposed data‑center OS, built a validation network using MESOS + Marathon + Docker. Nov 2015 – DCOS validation network launched; supported the “Double‑11” sales event. Dec 2015 – CRM application went live on the MESOS platform. 2. MESOS‑Based DCOS Implementation Resource Scheduling MESOS provides a two‑level scheduler: a master node receives resource offers from slaves and allocates them to frameworks such as Marathon. Task Scheduling Long‑running services are scheduled via Marathon, which launches containers on allocated resources. Application Packaging Docker is used to package applications, enabling consistent deployment across the cluster. Service Discovery & Registration Containers register with HAProxy through Marathon events, allowing dynamic load‑balancing. DCOS Architecture The cluster consists of MESOS masters, Marathon, HAProxy, and compute nodes (physical or virtual), deployed across two data centers with hardware load balancers for external traffic. 3. Operational Experience Automatic Elastic Scaling An auto‑scaling module monitors business concurrency and other health metrics to adjust container counts in real time. Marathon‑Etcd Integration for Service Discovery Etcd is integrated with Marathon events to achieve immediate service registration without per‑host agents. Data‑Center Switching Applications can be migrated wholesale between data centers for upgrades or disaster recovery, minimizing service disruption. Benefits of MESOS Resource Utilization: All applications share a common compute pool, improving overall usage. Cross‑Data‑Center Scheduling: MESOS operates over a three‑tier network, eliminating the need for additional networking layers. Elastic Scaling: Scaling time is reduced to the application start‑up duration, often under a minute for lightweight services. High Availability & Disaster Recovery: The platform isolates application failures, and container‑based deployment minimizes coupling with underlying infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

resource schedulingprivate cloudMesosDCOS
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.