Cloud Native 16 min read

How JD Built and Scaled Its First‑Generation Container Engine (JDOS) from Zero to 150k Containers

This article chronicles JD.com's journey from early OpenStack experiments in 2013 to the development, large‑scale deployment, and evolution of its first‑generation container engine (JDOS), highlighting architectural choices, performance challenges, operational practices, and the open‑source projects that emerged from the experience.

dbaplus Community
dbaplus Community
dbaplus Community
How JD Built and Scaled Its First‑Generation Container Engine (JDOS) from Zero to 150k Containers

Background and Early Exploration

In early 2013 JD.com formed a small team to evaluate virtualization. Within six months a 14‑person group mastered OpenStack and built a private cloud. Latency‑critical services required sub‑40 ms response; VM‑based deployments could only achieve ~60 ms, while bare‑metal met the target.

Adoption of Docker and Creation of JDOS 1.0

In autumn 2014 Docker became viable. Preliminary tests showed 99th‑percentile latency could approach the 40 ms goal, prompting a hybrid design that combined OpenStack (Icehouse) with Docker 1.3 and Open vSwitch 2.1.3. This first‑generation container engine, JDOS 1.0 (JD DataCenter OS), reduced application provisioning from a week to seconds.

Scale‑up and Production Impact

From 2015 to the 2016 "618" shopping festival all production workloads were migrated to containers, increasing application density and physical‑machine utilization by roughly threefold.

By 2016 the platform managed >150 000 containers, supporting major sales events with high reliability.

Technical Challenges and Engineering Solutions

OpenStack cluster size limits : Message loss, state hangs, DB overload, and agent failures appeared when a single cluster managed >4 000 nodes. JD replaced the MQ‑based RPC with a custom Python RPC framework ( brood) and routed all database operations through the in‑house JIMDB cache, eliminating direct DB updates from agents.

Operational scalability : A Chef‑based auto‑deployment system enabled adding up to 4 000 physical nodes per day per engineer.

Reliability : All platform components were designed stateless; daily health‑check suites verified OS, OpenStack services, kernel logs, and container runtime. Machine‑learning models predicted hardware failures (e.g., NIC CRC errors, ILO status) and triggered automated alerts and site‑level remediation.

Performance bottlenecks : Issues such as MAC‑table overflow, UDP large‑packet loss, and long slab‑allocator lock times were traced to kernel behavior. JD formed an internal Linux‑kernel team that tuned memory reclamation, optimized OVS flow handling, and upstreamed patches.

Evolution to the Next‑Generation Engine

Building on JDOS 1.0 experience, JD launched a new platform based on Kubernetes, Docker, and OVS. The design adds full CI/CD pipelines, unified monitoring, logging, and application‑level scheduling, moving from pure IaaS to a platform that orchestrates workloads directly.

Key Open‑Source Contributions

Hades : High‑performance DNS built on DPDK for accelerated UDP processing.

Cane : Kubernetes networking project consolidating JD’s high‑throughput network experience.

JNX : Customized Nginx branch optimized for container‑cluster traffic balancing and anti‑scraping.

JLK : JD‑maintained Linux‑kernel branch with large‑scale container optimizations (e.g., slab‑allocator, MAC table handling).

MDC : Unified monitoring platform for cloud‑native environments, focusing on alarm aggregation and root‑cause analysis.

Spider : East‑west, lossless networking solution for distributed container systems.

Operational Metrics and Lessons Learned

Container provisioning time reduced to seconds; application density increased threefold, yielding ~3× higher physical‑machine utilization.

By 2016 the platform supported >150 000 containers; during 2016 "618" and 2017 "Double‑11" events the system sustained peak loads without service degradation.

Stateless component design and daily X‑ray checks ensured that failures of individual services did not impact running containers.

Kernel‑level performance work proved essential; most stability issues traced back to Linux kernel subsystems.

The JD container journey demonstrates that deep integration of OpenStack, Docker, and later Kubernetes—combined with custom RPC frameworks, large‑scale automation, kernel tuning, and extensive monitoring—can deliver massive, reliable container deployments in a high‑traffic e‑commerce environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetescontainerizationOpenStackLarge‑Scale Deployment
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.