Project Eru: Scaling a Custom Docker Orchestration Platform to 10k Nodes
Project Eru, a homegrown Docker‑based orchestration system developed at Mango TV, replaces earlier PaaS attempts with a stateless, scalable core and agent architecture, leveraging Redis clusters, MacVLAN networking, and fine‑grained CPU allocation to achieve rapid, automated scaling across thousands of containers.
Background
The discussion originated from a weekly "Operations Talk" group where experts share experiences about large‑scale infrastructure. The speaker, formerly of Douban App Engine, describes how difficulties with Python runtime isolation and dependency conflicts led to an interest in Docker.
Early Docker Experiments
Initial attempts involved modifying CPython and sys.path, which proved costly. The team instead split runtime dependencies into separate packages to minimize contamination. A diagram (shown below) illustrates the early dependency‑splitting approach.
From NBE to Project Eru
After a first‑generation PaaS called NBE (Nebulium Engine) that used Docker for isolation, the team recognized limitations in resource control and scaling. In late 2014 they revisited concepts from Borg and Omega, launching the second‑generation platform—Project Eru—designed as a service‑orchestration and scheduling system rather than a traditional PaaS.
Eru can run both offline and online services, allocate CPU in fine‑grained increments (e.g., 0.1, 0.01 cores), and use Redis as a message bus to monitor container states.
Core and Agent Architecture
Eru consists of two loosely coupled components:
Agent : runs on each host, reports container status, and performs low‑level operations (e.g., veth management) via a private Redis Cluster.
Core : a stateless logical core that controls Docker daemons across hosts and interacts with Agents.
Networking Choice: MacVLAN
After evaluating tunnel‑based solutions (Weave, OVS) and routing‑based solutions (Calico, MacVLAN), the team selected MacVLAN for its performance, simplicity, and ability to apply layer‑2 QoS and security policies.
Storage Strategy
The platform primarily uses
devicemapperfor container storage, with a smaller portion using OverlayFS. Tests showed OverlayFS offers better performance for small files, though its atomicity differs from devicemapper.
Resource Allocation and Scaling
CPU is the primary scheduling dimension. Each container receives a “fragment” core (e.g., 0.1 CPU) and a share of a full core, allowing elastic usage. Memory is allocated proportionally to host capacity (e.g., 0.5 CPU and ~1 GB per Redis container). Scaling decisions are delegated to business teams via monitoring data stored in InfluxDB (later migrated to Open‑Falcon) and custom APIs.
"Who cares, does it" – the platform follows a "who monitors, who decides" principle, exposing APIs for dynamic scaling without imposing rigid policies.
Service Discovery and Security
Containers within the same logical subnet are reachable via an internal DNS built on Dnscache and Skydns. Firewall rules are applied at layer‑2, ensuring that only containers in the same subnet can communicate, providing a simple security model.
Redis clusters are exposed through Eru’s broadcasting mechanism; scaling actions trigger API calls that automatically add or remove instances, achieving near‑millisecond response times.
Performance Highlights
In tests with 10 000 hosts, a full scheduling decision completes in about one second. The system also supports a “Public Server” mode that monitors macro‑level host resources without binding specific CPU or memory, useful for CI pipelines and image builds.
Conclusion
Project Eru demonstrates how a custom, stateless, Docker‑centric orchestration platform can achieve large‑scale, fine‑grained resource management while remaining flexible enough for diverse business needs. All source code is publicly available on GitHub.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.