How JD.com Scaled MySQL with Docker: From Early Trials to 70% Production Deployment
This article recounts JD.com's journey of Dockerizing MySQL, covering the evolution of its container platform, reasons for container adoption, preparation steps, encountered challenges with large‑scale clusters, and the solutions that enabled over 70% of its MySQL instances to run reliably in Docker containers.
JD.com Docker Technology Evolution
JD.com began building a virtualization platform in 2013 using OpenStack + KVM. Performance (TP99 > 40 ms) was insufficient for core services. In September 2014, senior architect Liu Haifeng introduced Docker. After minimal development, TP99 fell to around 40 ms, prompting a shift to containerization.
Leveraging OpenStack expertise, JD.com created the first‑generation container engine JDOS 1.0 (JD DataCenter OS), which combined OpenStack orchestration with Docker runtime. By the 2015‑2016 “618” promotion, the platform achieved 100 % containerization of application workloads and scaled to roughly 150,000 Docker nodes, making it one of the world’s largest Docker clusters.
Why Dockerize MySQL at JD.com
MySQL usage grew rapidly from 2011, becoming the primary database for transaction systems by 2015. Docker offered four concrete benefits:
Rapid provisioning: A MySQL instance can be created in about one minute, dozens of times faster than installing a physical OS.
Dynamic scaling: Containers can be expanded online (CPU/Memory) without reboot, though disk capacity still requires careful planning.
Higher resource utilization: Containers isolate workloads at the process level, reducing CPU, memory and I/O contention compared with multiple MySQL instances sharing a single host.
Cost reduction: Better server and rack utilization lowers hardware and operational expenses.
JD.com’s mature Docker ecosystem—built through multiple large‑scale sales events (618, Double 11)—provided the reliability needed for production MySQL workloads.
Preparation Work Before Dockerizing MySQL
Docker Management UI
A web‑based portal was developed to let DBAs create, pause, start, and online‑scale MySQL containers with a few clicks, reducing manual effort.
Container Allocation Algorithm
The scheduler ensures high availability by preventing master‑slave pairs from being placed on the same host. Selection criteria include:
Host health status (alive, reachable).
Available resources (CPU, memory, disk).
Weight calculation based on resource usage.
Deduplication logic that excludes a host already chosen for a previous container in the same cluster.
Template and I/O‑Aware Scheduling
MySQL container templates (e.g., 8C/12G/500G, 12C/24G/500G) are defined in advance. The scheduler prefers hosts with low current I/O load for containers that request high I/O performance, thereby reducing cross‑container interference.
Integration with DB Management Platform
APIs were added to the existing database management system to support batch provisioning, decommissioning, and queries such as:
Given a host IP, list all MySQL containers running on it.
Given a container IP, retrieve its host and related metadata.
Monitoring Adjustments
JD.com uses Zabbix for MySQL metrics. Because Docker containers do not expose host‑level load via standard OS commands, a custom agent runs on each host, aggregates load data, stores it in Redis, and lets Zabbix pull the values from Redis.
Problems Encountered and Solutions
OpenStack scaling limits: At >10,000 physical nodes, message loss and agent hangs occurred. JD.com built a custom Python RPC framework named brood to replace MQ, and used the internal JIMDB cache for DB operations, eliminating the bottleneck.
Kernel‑level bugs: Issues such as MAC table overflow, slab memory contention, and UDP packet loss surfaced at large scale. An internal Linux‑kernel team created a JD‑specific kernel branch with patches to address these problems.
Zabbix agent reliability: Agents sometimes failed to restart after host reboot. A rc.local entry was added to ensure the Zabbix agent is started automatically.
Disk I/O interference: High‑I/O containers could degrade performance of co‑located containers. The scheduler now isolates I/O‑intensive workloads onto separate hosts.
Metric discrepancy between Docker and physical hosts: Load values collected inside containers differ from host values. The monitoring pipeline was adjusted to fetch host‑level metrics via the custom agent and store them centrally.
Current Deployment and Outlook
As of the latest release, over 70 % of JD.com’s MySQL instances run inside Docker containers, supporting multiple major sales events with stable performance. The fleet comprises roughly 150,000 containers, ranking among the world’s largest Docker deployments.
Future work includes:
Further automation of online scaling (reducing manual trigger steps).
Continued kernel optimization and maintenance of the JD‑specific branch.
Extending containerization to remaining critical workloads as Docker technology matures.
Key Technical Q&A (Condensed)
When to adopt MySQL Docker: Start with low‑impact services once the Docker platform is proven stable; high‑throughput or latency‑sensitive databases may remain on bare metal until Docker maturity improves.
Cost estimation: Calculate based on CPU cores, memory, and disk size per template (e.g., 8C/12G/500G, 12C/24G/500G, 12C/48G/1000G, 16C/48G/1000G).
I/O limits: CPU and memory can be strictly isolated; I/O shares the underlying hardware and may still cause interference, mitigated by host‑level scheduling.
Template modification: Templates can be upgraded if the host has sufficient free resources (e.g., increasing disk from 500 GB to 1 TB).
Data archiving: Active data is migrated to a historical MySQL cluster; older data can be offloaded to HBase or Hadoop.
Backup strategy: Use mysqldump or xtrabackup inside the container, store backups locally, then upload to external storage and schedule periodic snapshots.
OLAP suitability: Small analytical workloads can run on MySQL replicas; large‑scale analytics should use dedicated big‑data platforms (Hadoop, etc.).
Middleware vs. application‑level sharding: Some services use middleware; others implement sharding directly in the application.
Auto‑scaling: Current online scaling requires manual trigger; full automatic scaling is limited by host resource availability.
Volume sharing: Database containers use host‑local storage; shared Docker volumes are avoided to prevent data consistency issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
