Operations 14 min read

Mesos Architecture and Its Practical Use at Qunar: Framework Unification and Operational Insights

This article explains the Mesos distributed system kernel, its resource‑allocation workflow, and how Qunar engineers applied and evolved Mesos, Marathon, and custom frameworks to achieve fine‑grained scheduling, high availability, service discovery, and multi‑tenant management in a large‑scale production environment.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Mesos Architecture and Its Practical Use at Qunar: Framework Unification and Operational Insights

Mesos is defined as a distributed system kernel that follows the same design principles as the Linux kernel but operates at a higher abstraction level, providing APIs for applications such as Hadoop, Spark, Kafka, and Elasticsearch to manage resources and schedule tasks across an entire data‑center.

Originally a 2009 research project from UC Berkeley, Mesos was later adopted by Twitter and now runs in production at companies like Twitter and Airbnb.

The architecture consists of a Master that handles registration of slaves and framework schedulers and performs resource allocation, while each Slave receives tasks from the Master. Frameworks (also called applications) run tasks on the slaves.

The resource‑allocation process is illustrated with an example where a slave reports free resources, the Master offers them to a framework, the framework’s scheduler requests specific resource slices for its tasks, and the Master finally launches the tasks on the slave.

Mesos enables fine‑grained resource allocation, as shown by a comparison of coarse‑grained versus fine‑grained distribution.

Marathon is a Mesos framework that runs long‑living services (e.g., web applications) and acts as a distributed init system, providing REST APIs, HAProxy‑based service discovery, and load balancing.

Qunar’s experience with Mesos includes using it for real‑time log processing, running Spark, Alluxio, and Elasticsearch, and evolving Marathon versions (0.8 → 0.11 → 1.1) to address persistence bugs.

Key operational challenges identified were:

Framework unification: Whether all applications could be managed by Marathon.

Framework nesting: The cost‑benefit of nesting frameworks.

Custom frameworks provide precise scheduling but require extra development for monitoring, HA, and state persistence. Marathon offers richer monitoring and API‑driven automation, making it attractive for standardization.

To simplify operations, Qunar introduced a "Root Framework" (a dedicated Marathon instance) that manages all other frameworks, providing automatic failover, centralized service discovery, and reduced operational overhead.

Service discovery is achieved either via Marathon’s EventBus combined with etcd + confd + HAProxy (layer‑4) or via Bamboo + HAProxy and OpenResty + Lua (layer‑7).

Multi‑tenant support, dynamic resource reservation, and hierarchical framework nesting were also addressed, enabling scalable, resilient deployment of thousands of applications.

Overall, the article shares practical lessons on deploying Mesos at scale, unifying frameworks under Marathon, and improving reliability and manageability in a large‑scale production environment.

Distributed Systemsoperationsframeworkresource allocationcluster managementMesosMarathon
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.