Cloud Native 16 min read

Building Scalable Development Environments with Docker, Mesos, and Kubernetes: Lessons Learned

This article details a year‑long journey of designing, deploying, and operating container‑based development environments using Docker, Apache Mesos, and Kubernetes, covering the challenges of version consistency, rapid environment switching, resource isolation, and the practical solutions and lessons gathered from real‑world production use.

Qunar Tech Salon

Dec 14, 2015

Building Scalable Development Environments with Docker, Mesos, and Kubernetes: Lessons Learned

Background

At the beginning of the year, colleagues from the ticketing business line requested a Docker environment to accelerate development and testing, prompting the OpsDev team to explore container solutions.

For a system with dozens of rapidly iterating modules, establishing a stable development and self‑test environment proved difficult due to the need for VMs, profiles, Jenkins jobs, deployments, and service dependencies.

Operationally, maintaining such environments was cumbersome: configuration files required manual edits when switching environments, multiple versioned environments were costly to maintain, cross‑team coordination was needed for integration, and ensuring version consistency across modules was a major pain point.

Key Problems Identified

Version consistency across code, configuration, and database schema.

Fast switching among multiple environments.

Service dependency handling to enable new developers to deploy complex stacks easily.

Simplified maintenance, e.g., automatic inclusion of new projects.

Low learning curve to let developers focus on business logic.

Environment isolation, ideally one complete environment per developer.

Temporary Solution with Docker‑Compose

The business line built a temporary environment using docker‑compose, but it required manual version management and Nginx forwarding, exposing further issues such as resource limits on physical hosts, port conflicts when scaling, continuous integration of databases, and fixed container IPs.

Seeking a Sustainable Solution

After reviewing existing container orchestration platforms, the team focused on Apache Mesos and Google Kubernetes. Kubernetes’ pod and service concepts matched business needs, while Mesos offered flexible resource management. Both were tested in parallel.

Pilot Project: ELK‑Based Log Platform

The team chose an ELK‑based logging platform as a pilot for Mesos + Docker. Logstash and Kibana were containerized; Kibana is stateless, Logstash handles SIGTERM gracefully. Elasticsearch was kept outside the Mesos cluster for persistence.

Marathon and Chronos scheduled Logstash, Kibana, and related monitoring containers.

Data ingestion used rsyslog for system logs, Flume for business logs, and Heka/Fluentd for container logs, all funneling into a Kafka cluster before being processed by Logstash and stored in Elasticsearch. The platform grew to handle 60 billion log entries (≈6 TB) per day.

Problems and Experience Summary

1. Daemon OOM

Docker 1.6’s attach interface leaked memory, causing the daemon to OOM when stdout produced many logs.

fatal error: runtime: out of memory

runtime stack:
runtime.SysMap(0xc2c9760000, 0x7f310000, 0x7f453c96b000, 0x13624f8)
        /usr/local/go/src/runtime/mem_linux.c:149 +0x98
runtime.MHeap_SysAlloc(0x1367be0, 0x7f310000, 0x43b8f2)
        /usr/local/go/src/runtime/malloc.c:284 +0x124
runtime.MHeap_Alloc(0x1367be0, 0x3f986, 0x10100000000, 0x0)
        /usr/local/go/src/runtime/mheap.c:240 +0x66
...

The fix involved using runsv to restart the daemon, adjusting oom_adj to –15, and ultimately upgrading Docker.

2. Heka DockerEventInput Socket Leak

The go‑dockerclient used by Heka had a bug that left sockets open after Heka exited, leading to file‑descriptor leaks.

time="2015-09-30T15:25:00.254779538+08:00" level=error msg="attach: stdout: write unix @: broken pipe"
time="2015-09-30T15:25:00.254883039+08:00" level=error msg="attach: stdout: write unix @: broken pipe"
time="2015-09-30T15:25:00.256959458+08:00" level=error msg="attach: stdout: write unix @: broken pipe"

Reference: https://github.com/fsouza/go-dockerclient/issues/202.

3. Pre‑warming New Slaves

To avoid slow first‑time image pulls, the team pre‑pulled common images via Salt/Ansible scripts and noted Marathon’s lack of auto‑scaling for monitoring containers.

4. Distribution‑Induced Daemon Crash

After upgrading to Docker 1.7.1, a mis‑configured Marathon registry caused the daemon to pull the official image instead of the private one, leading to crashes.

5. Mesos Resource Preemption

Mesos 0.23.0 introduced resource preemption (still not recommended for production). The team allocated static CPU quotas per role (e.g., logstash 32 CPU, ops 4 CPU) to simulate over‑commitment.

MESOS_resources="cpus(logstash):32;"
MESOS_resources="${MESOS_resources}cpus(common):4;"
MESOS_resources="${MESOS_resources}cpus(kibana):4;"
MESOS_resources="${MESOS_resources}cpus(ops):4;"
MESOS_resources="${MESOS_resources}cpus(spark):16;"
MESOS_resources="${MESOS_resources}cpus(storm):16;"
MESOS_resources="${MESOS_resources}cpus(rebuild):32;"
MESOS_resources="${MESOS_resources}cpus(mysos):16;"
MESOS_resources="${MESOS_resources}cpus(others):16;"
MESOS_resources="${MESOS_resources}cpus(universe):1;"
MESOS_resources="${MESOS_resources}cpus(test):8;"
MESOS_resources="${MESOS_resources}mem(*):126976;ports(*):[8000-32000]"

Scheduling was based on time windows (e.g., low‑traffic night hours for Spark jobs).

6. Version Upgrade Strategy

Upgrading Mesos and Docker required a whitelist‑based rolling upgrade: remove a node from the whitelist, stop its containers, stop Docker and Mesos daemons, upgrade, restart, and re‑add the node.

Rapid Development‑Environment Rebuild

Based on the logging platform experience, three major evolution stages were implemented:

OpenStack + nova‑docker + VLAN – containers acted like VMs with independent IPs.

Mesos + Marathon + Docker (‑‑net=host) + random ports – shifted to service‑centric deployment.

Mesos + Marathon + Docker + Calico – provided fixed IPs and seamless intra‑cluster communication.

Stage 1: Containers as Virtual Machines

OpenStack’s nova‑docker supplied most VM‑like features; containers ran salt‑minion and sshd for debugging.

Stage 2: Service‑Centric Deployment

Service trees were refined, dependencies stored in QAECI, and deployments became parallel/serial based on topological sorting. Code and config were fetched at container start, ports were randomized and injected via environment variables, and logs were routed to stdout/stderr with Heka tagging.

Stage 3: Fixed IP with Calico

Calico was introduced to assign static IPs and integrate with Mesos via the slave’s ‑‑modules and ‑‑isolation flags.

./bin/mesos‑slave.sh --master=master_ip:port --namespaces='network' \
    --modules=file://path/to/slave_gssapi.json \
    --isolation="com_mesosphere_mesos_MetaswitchNetworkIsolator" \
    --executor_environment_variables={"DOCKER_HOST": "localhost:2377"}

{   "libraries": [     {       "file": "/path/to/libmetaswitch_network_isolator.so"       "modules": [         {           "name": "com_mesosphere_mesos_MetaswitchNetworkIsolator",           "parameters": [             {               "key": "initialization_command",               "value": "python /path/to/initialization_script.py arg1 arg2"             },             {               "key": "cleanup_command",               "value": "python /path/to/cleanup_script.py arg1 arg2"             }           ]         }       ]     }   ] }

Environment variables CALICO_IP and CALICO_PROFILE were added to Marathon tasks, and container names/IPs were registered in the internal DNSDB for company‑wide access.

Summary

After nearly a year of production use, the team accumulated many lessons: Mesos proved stable and scalable, but its scheduling strategies are still relatively simple and rely on framework‑level logic. Future work includes exploring Swarm on Mesos to combine Docker’s native clustering with Mesos’ resource management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker kubernetes devops infrastructure Mesos Container Orchestration

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Key Problems Identified

Temporary Solution with Docker‑Compose

Seeking a Sustainable Solution

Pilot Project: ELK‑Based Log Platform

Problems and Experience Summary

1. Daemon OOM

2. Heka DockerEventInput Socket Leak

3. Pre‑warming New Slaves

4. Distribution‑Induced Daemon Crash

5. Mesos Resource Preemption

6. Version Upgrade Strategy

Rapid Development‑Environment Rebuild

Stage 1: Containers as Virtual Machines

Stage 2: Service‑Centric Deployment

Stage 3: Fixed IP with Calico

Summary

Qunar Tech Salon

How this landed with the community

Was this worth your time?

0 Comments

Stage 1: Containers as Virtual Machines

Stage 2: Service‑Centric Deployment

Stage 3: Fixed IP with Calico