Cloud Computing 14 min read

Inside UCloud’s Compute Factory: Scaling VMs and Containers with Mesos

UCloud’s Compute Factory enables rapid provisioning of massive VM resources for compute‑intensive services by leveraging a Mesos‑based resource management platform that unifies multi‑region data centers, supports both VMs and containers, and addresses challenges in scheduling, networking, storage, and operational reliability.

UCloud Tech
UCloud Tech
UCloud Tech
Inside UCloud’s Compute Factory: Scaling VMs and Containers with Mesos

Abstract

To meet the demands of rendering, genome sequencing and other compute‑intensive services, UCloud launched the “Compute Factory” product, allowing users to quickly create large numbers of virtual machines. The product is powered by a Mesos‑based resource management system, whose architecture, usage, solutions, and challenges are described below.

Business Requirements

The platform must satisfy two main requirements:

Support both virtual machines and containers. Containers are popular, but some workloads require strict security isolation or run on Windows, which VMs provide.

Integrate resources across multiple regions and data centers, aggregating idle partner resources while minimizing operational costs.

In short, a unified platform is needed to encapsulate compute resources from many data centers and expose them as VMs, containers, or other forms.

Why Choose Mesos

Mesos is an Apache open‑source distributed resource manager that acts as the kernel of a distributed system. It abstracts a data center into a pool of resources (CPU, memory, storage, GPU, etc.) and offers high scalability with a simple, modular design. By implementing custom Frameworks and Executors, UCloud can package resources as VMs, containers, and more. Although Mesos is widely used for container orchestration, using it to manage VMs in production is novel, and this article shares UCloud’s experience.

Mesos Overview

Mesos follows a Master‑Agent architecture. The Master performs global scheduling and exposes APIs, while Agents run on each node, executing tasks via Executors and reporting status.

Mesos provides a two‑level scheduling model:

The Master schedules resources among Frameworks.

Each Framework implements its own internal scheduling for specific workloads.

Architecture Design

The overall architecture consists of:

Multiple Mesos clusters per IDC.

Cluster Server interacting with Mesos Master and Frameworks, handling internal scheduling and task dispatch.

Frameworks such as VM Scheduler (for VMs) and Marathon (for Docker containers).

VM Framework’s Executor built on libvirt to create, delete, start, stop, and snapshot VMs.

API Server that aggregates status, performs cross‑cluster scheduling, and exposes APIs to the UCloud console.

API Gateway providing external API access.

HTTP‑Based Communication

All internal communication uses HTTP. Mesos components communicate via libprocess, which implements an actor model where each actor listens for HTTP requests. Business components such as API Server and Cluster Server expose RESTful APIs. HTTP is chosen for its simplicity, reliability, and ease of debugging.

VM Scheduler

The VM Scheduler Framework receives resource offers from the Mesos Master. When a VM task arrives, the Cluster Server forwards the task description to the VM Scheduler, which then matches the offer and creates a Task for the Master to execute. Two task types are supported:

Create/delete a VM with specified image, network, and storage.

Operate a VM (power on/off, reboot, snapshot) via Framework messages to the VM Executor.

VM Executor

The Executor is a custom component that manages the VM lifecycle. It receives Tasks, generates libvirt configuration files, and invokes libvirt to perform creation, deletion, power operations, and snapshotting. It also reports detailed VM states (e.g., booting, shutting down) back to the Scheduler through heartbeat messages, which are then propagated to the API Server and persisted in the database.

VM Scheduling

Mesos schedules Tasks using the Dominant Resource Fairness (DRF) algorithm. The process involves:

Agents report their available resources to the Master.

The Master offers resources to Frameworks.

Frameworks decide how many Tasks to launch.

The Master instructs Agents to start the Tasks.

For VM provisioning, scheduling occurs in two stages: selecting a suitable cluster based on resource needs, and then performing intra‑cluster scheduling to allocate VMs according to a resource plan.

Resource Identification

Mesos distinguishes resources (CPU, memory, disk, GPU, ports) and attributes (key‑value tags) to capture additional characteristics such as CPU model, SSD size, rack location, or external IP availability. Frameworks match Resource Offers against task requirements using both resources and attributes to make placement decisions.

Image, Storage, and Network Management

Base images are stored in a GlusterFS‑backed distributed storage system. Users can create custom images, and shared storage is provided via GlusterFS for workloads that require it. Networking is isolated per user through multiple subnets; each VM must specify the subnet it belongs to.

Other Issues

During operation, several problems were encountered:

Marathon leader election failures under high I/O load, traced to a Mesos driver bug; resolved by adding defensive code to restart Marathon when Zookeeper becomes unstable.

go‑marathon library limitations , including lack of multi‑node support and timeout handling; addressed by forking the library, separating HTTP clients for SSE, and adding explicit timeouts.

Conclusion

Mesos is widely used within UCloud for products such as “Compute Factory” and UDocker, as well as the internal VM management platform. Continuous practice has deepened UCloud’s understanding and control of Mesos, enabling reliable, scalable cloud resource management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingResource ManagementMesoscontainer orchestrationvirtual machinesUCloud
UCloud Tech
Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.