Designing a Private Job Cloud on Mesos: Lessons from Dangdang’s Elastic‑Job‑Cloud
This article details Dangdang’s experience building a private job‑cloud platform using Mesos and a custom framework, covering technology selection, key features, integration challenges with Elastic‑Job, resource scheduling, and practical recommendations for Mesos‑based Java workloads.
Background and Motivation
In 2016 Dangdang’s architecture team needed to migrate business platforms to a private cloud with low risk and controllable effort. Existing distributed scheduling framework Elastic‑Job was widely used but relied on IP‑based sharding and lacked cloud‑native features, making it unsuitable for a private cloud without heavy networking work.
Technology Stack Selection
Key requirements for the new platform were:
Full compatibility with Elastic‑Job’s API.
Support for both long‑running (resident) and short‑lived (ephemeral) jobs.
Fine‑grained resource control and automated deployment.
Kubernetes’s multi‑scheduler feature was immature at the time, so the team adopted a two‑level scheduling model on Apache Mesos, which could more easily satisfy the requirements. The resulting platform, named Elastic‑Job‑Cloud , re‑uses Elastic‑Job’s API while introducing a custom Mesos Framework.
Architecture Overview
Elastic‑Job‑Cloud consists of two main components:
Scheduler : a Mesos Framework that receives resource offers and dispatches jobs.
Customized Executor : runs job JARs and reuses a Spring container to avoid repeated initialization.
Unlike Elastic‑Job‑Lite, which uses Zookeeper for coordination, Elastic‑Job‑Cloud drops Zookeeper as a registration center. It uses the Mesos Framework statusUpdate API for high‑availability, re‑sharding, and failover, while Zookeeper is retained only for persisting job metadata and queues.
Sharding and Resource Allocation
Sharding logic is centralized: the Scheduler converts shards into Mesos TaskInfo objects, eliminating IP‑based conflicts. Resource matching is delegated to the open‑source Fenzo library, which provides sophisticated Mesos resource‑matching strategies. Users should be aware of a known memory‑leak issue in Fenzo when tasks are not properly released; each allocated Task must be explicitly removed using its TaskID.
Scheduling Model
The platform maintains two queues:
Offer queue : stores Mesos resource offers.
Job queue : stores pending jobs.
A job is dispatched only when a suitable offer is available, which can lead to offer accumulation under low demand but balances asynchronous offer collection and job execution.
Executor Reuse Strategy
Executor instances are created per JAR rather than per job. All jobs that share the same JAR (even with different names or cron expressions) reuse a single Executor, dramatically reducing memory and CPU overhead. The Spring context is instantiated once at the first job launch and reused for subsequent executions.
Challenges and Future Work
Mesos’s default registration does not support etcd; adding etcd as an optional configuration store is planned.
A one‑stop resource allocation model similar to Fenzo would simplify framework development.
Improved JVM‑friendly support in Mesos is needed for better Java service integration.
Future enhancements include migrating job metadata and queue state to etcd, supporting dynamic versus static Executor modes, and extending job types to include Shell and Dataflow jobs.
Open‑Source Release
The source code for both Elastic‑Job‑Lite and Elastic‑Job‑Cloud is available at https://github.com/dangdangdotcom/elastic-job. The repository contains separate branches for the Lite and Cloud implementations and is actively maintained.
Operational Metrics
Initial development involved 2 core engineers; the full platform later involved 8 contributors.
Security relies on the company’s SSO integration; no additional hardening was added.
The job cloud runs approximately 4,000 job instances on a small cluster (≈10 machines, including Mesos masters, executors, and Zookeeper nodes).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
