Operations 14 min read

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

Mesos is described as a distributed system kernel that follows the same design principles as the Linux kernel but operates at a higher abstraction level, providing resource management and task scheduling for applications such as Hadoop, Spark, Kafka, and Elasticsearch across all servers in a data center.

Originally launched in 2009 as a Berkeley research project and later adopted by Twitter and Airbnb, Mesos consists of a Master that registers slaves and framework schedulers and allocates resources, and Slaves that execute tasks on behalf of frameworks.

The resource allocation workflow is illustrated with a step‑by‑step example: a slave reports free resources, the master offers them to a framework, the framework’s scheduler requests specific CPU and memory slices, and the master finally dispatches tasks to the slave’s executor.

Mesos enables fine‑grained resource distribution, contrasting coarse‑grained allocation, and Marathon is highlighted as a Mesos framework that runs long‑lived services, provides a REST API, and integrates with HAProxy for service discovery and load balancing.

Qunar’s experience is detailed: Mesos has been used since version 0.22 for data‑analysis workloads, with Marathon versions 0.8‑0.11 (and a recommendation to use 1.1 due to a persistent‑volume bug). Spark, Alluxio, etcd, and HDFS are run on Mesos, with specific considerations for persistent storage and SSD‑aware scheduling.

The article discusses two main operational questions: whether all frameworks can be unified under Marathon and whether framework nesting is worthwhile, exploring the trade‑offs of custom frameworks versus Marathon’s built‑in monitoring, HA, and API support.

Challenges such as abnormal task recovery, message control, service discovery, lack of monitoring in custom frameworks, and multi‑tenant resource allocation are examined, leading to the proposal of a “Root Framework” that centralizes framework management, improves HA, simplifies service discovery, and reduces operational overhead.

Finally, the piece outlines additional optimizations like fail‑over timeouts, dynamic resource reservations, and hierarchical framework deployment to support large‑scale, multi‑tenant environments.

big dataoperationsframeworkResource Schedulingcluster managementMesosMarathon
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.