How JD Built and Scaled Its First‑Generation Container Engine (JDOS) from Zero to 150k Containers
This article chronicles JD.com's journey from early OpenStack experiments in 2013 to the development, large‑scale deployment, and evolution of its first‑generation container engine (JDOS), highlighting architectural choices, performance challenges, operational practices, and the open‑source projects that emerged from the experience.
Background and Early Exploration
In early 2013 JD.com formed a small team to evaluate virtualization. Within six months a 14‑person group mastered OpenStack and built a private cloud. Latency‑critical services required sub‑40 ms response; VM‑based deployments could only achieve ~60 ms, while bare‑metal met the target.
Adoption of Docker and Creation of JDOS 1.0
In autumn 2014 Docker became viable. Preliminary tests showed 99th‑percentile latency could approach the 40 ms goal, prompting a hybrid design that combined OpenStack (Icehouse) with Docker 1.3 and Open vSwitch 2.1.3. This first‑generation container engine, JDOS 1.0 (JD DataCenter OS), reduced application provisioning from a week to seconds.
Scale‑up and Production Impact
From 2015 to the 2016 "618" shopping festival all production workloads were migrated to containers, increasing application density and physical‑machine utilization by roughly threefold.
By 2016 the platform managed >150 000 containers, supporting major sales events with high reliability.
Technical Challenges and Engineering Solutions
OpenStack cluster size limits : Message loss, state hangs, DB overload, and agent failures appeared when a single cluster managed >4 000 nodes. JD replaced the MQ‑based RPC with a custom Python RPC framework ( brood) and routed all database operations through the in‑house JIMDB cache, eliminating direct DB updates from agents.
Operational scalability : A Chef‑based auto‑deployment system enabled adding up to 4 000 physical nodes per day per engineer.
Reliability : All platform components were designed stateless; daily health‑check suites verified OS, OpenStack services, kernel logs, and container runtime. Machine‑learning models predicted hardware failures (e.g., NIC CRC errors, ILO status) and triggered automated alerts and site‑level remediation.
Performance bottlenecks : Issues such as MAC‑table overflow, UDP large‑packet loss, and long slab‑allocator lock times were traced to kernel behavior. JD formed an internal Linux‑kernel team that tuned memory reclamation, optimized OVS flow handling, and upstreamed patches.
Evolution to the Next‑Generation Engine
Building on JDOS 1.0 experience, JD launched a new platform based on Kubernetes, Docker, and OVS. The design adds full CI/CD pipelines, unified monitoring, logging, and application‑level scheduling, moving from pure IaaS to a platform that orchestrates workloads directly.
Key Open‑Source Contributions
Hades : High‑performance DNS built on DPDK for accelerated UDP processing.
Cane : Kubernetes networking project consolidating JD’s high‑throughput network experience.
JNX : Customized Nginx branch optimized for container‑cluster traffic balancing and anti‑scraping.
JLK : JD‑maintained Linux‑kernel branch with large‑scale container optimizations (e.g., slab‑allocator, MAC table handling).
MDC : Unified monitoring platform for cloud‑native environments, focusing on alarm aggregation and root‑cause analysis.
Spider : East‑west, lossless networking solution for distributed container systems.
Operational Metrics and Lessons Learned
Container provisioning time reduced to seconds; application density increased threefold, yielding ~3× higher physical‑machine utilization.
By 2016 the platform supported >150 000 containers; during 2016 "618" and 2017 "Double‑11" events the system sustained peak loads without service degradation.
Stateless component design and daily X‑ray checks ensured that failures of individual services did not impact running containers.
Kernel‑level performance work proved essential; most stability issues traced back to Linux kernel subsystems.
The JD container journey demonstrates that deep integration of OpenStack, Docker, and later Kubernetes—combined with custom RPC frameworks, large‑scale automation, kernel tuning, and extensive monitoring—can deliver massive, reliable container deployments in a high‑traffic e‑commerce environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
