Building Scalable Development Environments with Docker, Mesos, and Kubernetes: Lessons Learned
This article details a year‑long journey of designing, deploying, and operating container‑based development environments using Docker, Apache Mesos, and Kubernetes, covering the challenges of version consistency, rapid environment switching, resource isolation, and the practical solutions and lessons gathered from real‑world production use.
Background
At the beginning of the year, colleagues from the ticketing business line requested a Docker environment to accelerate development and testing, prompting the OpsDev team to explore container solutions.
For a system with dozens of rapidly iterating modules, establishing a stable development and self‑test environment proved difficult due to the need for VMs, profiles, Jenkins jobs, deployments, and service dependencies.
Operationally, maintaining such environments was cumbersome: configuration files required manual edits when switching environments, multiple versioned environments were costly to maintain, cross‑team coordination was needed for integration, and ensuring version consistency across modules was a major pain point.
Key Problems Identified
Version consistency across code, configuration, and database schema.
Fast switching among multiple environments.
Service dependency handling to enable new developers to deploy complex stacks easily.
Simplified maintenance, e.g., automatic inclusion of new projects.
Low learning curve to let developers focus on business logic.
Environment isolation, ideally one complete environment per developer.
Temporary Solution with Docker‑Compose
The business line built a temporary environment using docker‑compose, but it required manual version management and Nginx forwarding, exposing further issues such as resource limits on physical hosts, port conflicts when scaling, continuous integration of databases, and fixed container IPs.
Seeking a Sustainable Solution
After reviewing existing container orchestration platforms, the team focused on Apache Mesos and Google Kubernetes. Kubernetes’ pod and service concepts matched business needs, while Mesos offered flexible resource management. Both were tested in parallel.
Pilot Project: ELK‑Based Log Platform
The team chose an ELK‑based logging platform as a pilot for Mesos + Docker. Logstash and Kibana were containerized; Kibana is stateless, Logstash handles SIGTERM gracefully. Elasticsearch was kept outside the Mesos cluster for persistence.
Marathon and Chronos scheduled Logstash, Kibana, and related monitoring containers.
Data ingestion used rsyslog for system logs, Flume for business logs, and Heka/Fluentd for container logs, all funneling into a Kafka cluster before being processed by Logstash and stored in Elasticsearch. The platform grew to handle 60 billion log entries (≈6 TB) per day.
Problems and Experience Summary
1. Daemon OOM
Docker 1.6’s attach interface leaked memory, causing the daemon to OOM when stdout produced many logs.
fatal error: runtime: out of memory
runtime stack:
runtime.SysMap(0xc2c9760000, 0x7f310000, 0x7f453c96b000, 0x13624f8)
/usr/local/go/src/runtime/mem_linux.c:149 +0x98
runtime.MHeap_SysAlloc(0x1367be0, 0x7f310000, 0x43b8f2)
/usr/local/go/src/runtime/malloc.c:284 +0x124
runtime.MHeap_Alloc(0x1367be0, 0x3f986, 0x10100000000, 0x0)
/usr/local/go/src/runtime/mheap.c:240 +0x66
...The fix involved using runsv to restart the daemon, adjusting oom_adj to –15, and ultimately upgrading Docker.
2. Heka DockerEventInput Socket Leak
The go‑dockerclient used by Heka had a bug that left sockets open after Heka exited, leading to file‑descriptor leaks.
time="2015-09-30T15:25:00.254779538+08:00" level=error msg="attach: stdout: write unix @: broken pipe"
time="2015-09-30T15:25:00.254883039+08:00" level=error msg="attach: stdout: write unix @: broken pipe"
time="2015-09-30T15:25:00.256959458+08:00" level=error msg="attach: stdout: write unix @: broken pipe"Reference: https://github.com/fsouza/go-dockerclient/issues/202.
3. Pre‑warming New Slaves
To avoid slow first‑time image pulls, the team pre‑pulled common images via Salt/Ansible scripts and noted Marathon’s lack of auto‑scaling for monitoring containers.
4. Distribution‑Induced Daemon Crash
After upgrading to Docker 1.7.1, a mis‑configured Marathon registry caused the daemon to pull the official image instead of the private one, leading to crashes.
5. Mesos Resource Preemption
Mesos 0.23.0 introduced resource preemption (still not recommended for production). The team allocated static CPU quotas per role (e.g., logstash 32 CPU, ops 4 CPU) to simulate over‑commitment.
MESOS_resources="cpus(logstash):32;"
MESOS_resources="${MESOS_resources}cpus(common):4;"
MESOS_resources="${MESOS_resources}cpus(kibana):4;"
MESOS_resources="${MESOS_resources}cpus(ops):4;"
MESOS_resources="${MESOS_resources}cpus(spark):16;"
MESOS_resources="${MESOS_resources}cpus(storm):16;"
MESOS_resources="${MESOS_resources}cpus(rebuild):32;"
MESOS_resources="${MESOS_resources}cpus(mysos):16;"
MESOS_resources="${MESOS_resources}cpus(others):16;"
MESOS_resources="${MESOS_resources}cpus(universe):1;"
MESOS_resources="${MESOS_resources}cpus(test):8;"
MESOS_resources="${MESOS_resources}mem(*):126976;ports(*):[8000-32000]"Scheduling was based on time windows (e.g., low‑traffic night hours for Spark jobs).
6. Version Upgrade Strategy
Upgrading Mesos and Docker required a whitelist‑based rolling upgrade: remove a node from the whitelist, stop its containers, stop Docker and Mesos daemons, upgrade, restart, and re‑add the node.
Rapid Development‑Environment Rebuild
Based on the logging platform experience, three major evolution stages were implemented:
OpenStack + nova‑docker + VLAN – containers acted like VMs with independent IPs.
Mesos + Marathon + Docker (‑‑net=host) + random ports – shifted to service‑centric deployment.
Mesos + Marathon + Docker + Calico – provided fixed IPs and seamless intra‑cluster communication.
Stage 1: Containers as Virtual Machines
OpenStack’s nova‑docker supplied most VM‑like features; containers ran salt‑minion and sshd for debugging.
Stage 2: Service‑Centric Deployment
Service trees were refined, dependencies stored in QAECI, and deployments became parallel/serial based on topological sorting. Code and config were fetched at container start, ports were randomized and injected via environment variables, and logs were routed to stdout/stderr with Heka tagging.
Stage 3: Fixed IP with Calico
Calico was introduced to assign static IPs and integrate with Mesos via the slave’s ‑‑modules and ‑‑isolation flags.
./bin/mesos‑slave.sh --master=master_ip:port --namespaces='network' \
--modules=file://path/to/slave_gssapi.json \
--isolation="com_mesosphere_mesos_MetaswitchNetworkIsolator" \
--executor_environment_variables={"DOCKER_HOST": "localhost:2377"}
{ "libraries": [ { "file": "/path/to/libmetaswitch_network_isolator.so" "modules": [ { "name": "com_mesosphere_mesos_MetaswitchNetworkIsolator", "parameters": [ { "key": "initialization_command", "value": "python /path/to/initialization_script.py arg1 arg2" }, { "key": "cleanup_command", "value": "python /path/to/cleanup_script.py arg1 arg2" } ] } ] } ] }Environment variables CALICO_IP and CALICO_PROFILE were added to Marathon tasks, and container names/IPs were registered in the internal DNSDB for company‑wide access.
Summary
After nearly a year of production use, the team accumulated many lessons: Mesos proved stable and scalable, but its scheduling strategies are still relatively simple and rely on framework‑level logic. Future work includes exploring Swarm on Mesos to combine Docker’s native clustering with Mesos’ resource management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
