Cloud Native 16 min read

JD's Migration from OpenStack to Kubernetes: Lessons and Architecture of JDOS 2.0

Since the end of 2016, JD has been transitioning its infrastructure from OpenStack to Kubernetes, completing 20% of the migration and aiming for full conversion by Q2, and shares detailed experiences, architectural evolution, operational practices, and future directions for large‑scale container platforms.

DevOps
DevOps
DevOps
JD's Migration from OpenStack to Kubernetes: Lessons and Architecture of JDOS 2.0

Background Introduction

At the end of 2016 JD launched JDOS 2.0, a next‑generation container engine, and began migrating from OpenStack to Kubernetes. To date 20% of the workload has been moved, with a target of full migration by Q2. The Kubernetes solution offers a simpler architecture than OpenStack, and JD shares the lessons learned for the industry.

Cluster Construction History

Physical Machine Era (2004‑2014)

Before 2014 applications were deployed directly on physical servers, taking about a week from resource request to allocation. Co‑location of many Tomcat instances on a single machine caused resource waste and inflexible scheduling. JD built tools for compilation, packaging, automated deployment, log collection, and monitoring to improve efficiency.

Containerization Era (2014‑2016)

In Q3 2014 the chief architect led the team to redesign the cluster, selecting Docker as the primary technology. Extensive stress and stability testing led to custom patches for Device Mapper crashes, kernel issues, and added features such as external disk throttling, capacity management, and image layer merging.

For cluster management JD adopted an OpenStack + nova‑docker architecture, creating the first‑generation container engine JDOS 1.0 (JD DataCenter OS). JDOS 1.0 achieved infrastructure containerization and unified container‑based application deployment.

Operationally, resource provisioning shrank from a week to minutes, and container isolation increased deployment density and physical server utilization by threefold, delivering significant cost savings.

Multi‑IDC deployment with a global API enabled cross‑IDC workload placement. By November 2016 JDOS 1.0 was running over 150,000 containers, supporting major sales events.

Although JDOS 1.0 still relied on VM‑based management (IaaS) and legacy deployment tools, it laid a solid foundation for the subsequent generation platform.

Next‑Generation Application Container Engine (JDOS 2.0)

Pain Points of JDOS 1.0

JDOS 1.0 solved containerization but retained many shortcomings: legacy build and deployment tools conflicted with the container "run‑out‑of‑the‑box" model, causing slow startup; inconsistencies between online and offline environments prevented true "build once, run anywhere"; heavy container images still required auxiliary tools, limiting flexible scaling and high availability; and scheduling was based only on simple resource availability, capping performance and utilization.

Platform Architecture

When container counts grew from thousands to tens of thousands, JD started developing JDOS 2.0 around Kubernetes, integrating storage and networking from JDOS 1.0 and providing a full CI/CD pipeline from source code to image to deployment, along with logging, monitoring, troubleshooting, terminal, and orchestration capabilities.

JDOS 2.0 defines two levels: a system (mapped to a Kubernetes namespace) containing multiple applications, and an application consisting of a set of container instances providing the same service. This model supports versioning, domain resolution, load balancing, and configuration management.

Core JDOS 2.0 components such as GitLab, Jenkins, Harbor, Logstash, Elasticsearch, and Prometheus are also containerized and run on the Kubernetes platform.

One‑Stop Solution for Developers

JDOS 2.0 implements image‑centric CI/CD. The workflow is:

Developer pushes code to the source repository.

Jenkins master triggers a build job.

Jenkins master creates a Jenkins slave pod on Kubernetes.

Slave pulls source code and performs compilation and packaging.

Artifacts and Dockerfile are sent to a build node.

Image is built on the node.

Image is pushed to Harbor.

Images are deployed or updated in the target environments.

In JDOS 1.0 images contained only OS and runtime; deployment still relied on external tools. In JDOS 2.0 the full application stack is baked into the image, achieving true "run‑out‑of‑the‑box".

Network and External Service Load Balancing

JDOS 2.0 inherits JDOS 1.0's OpenStack‑Neutron VLAN mode, giving each pod a dedicated port and IP. Using the CNI standard, JD developed the "Cane" project to integrate Kubelet with Neutron.

Cane also creates and manages Neutron LBaaS resources for Kubernetes LoadBalancer services, and provides internal DNS via the open‑source Hades component (https://github.com/ipdcode/hades).

Flexible Scheduling

JDOS 2.0 supports diverse workloads—big data, web services, deep learning—by applying different resource limits and Kubernetes labels. This enables richer scheduling policies and mixed IDC deployments of online and offline tasks, improving overall resource utilization by roughly 30%.

Promotion and Outlook

With JDOS 1.0’s stable operation as a foundation, users trust containers, but platform‑level containers differ from infrastructure‑level ones. IPs may change on failure, so service discovery must rely on DNS, load balancers, or self‑registration. JD built an intelligent domain resolution service and a DPDK‑based high‑performance load balancer to work with Kubernetes.

Increasing big‑data and AI workloads have been migrated into JDOS 2.0, using isolated zones but unified management, and machine‑learning‑driven resource optimization.

Future plans include richer scheduling algorithms and energy‑saving techniques to improve ROI and build a low‑energy, high‑performance green data center.

Retrospect and Summary

Compared with OpenStack, Kubernetes offers a simpler architecture with fewer components, clear functions, and a declarative API inspired by Google’s Borg. Its flexible design, labeling, and built‑in replica control enable rapid scaling and high availability, allowing JDOS 2.0 to serve about 20% of applications, run two clusters with ~20,000 containers, and continue expanding.

JD thanks the Kubernetes community and open‑source contributors; JD has joined CNCF and ranks in the top‑30.

Author Introduction

Baoyong Cheng, Technical Director of JD’s Platform Department, led the development of JDOS 1.0 and JDOS 2.0, providing a unified compute platform for all JD business. His current focus is JDOS 2.0 R&D and the first generation software‑defined data center.

Easter Egg Moment

In 2017, JD invited readers to explore emerging technologies such as big‑data frameworks, machine‑learning architectures, low‑latency systems, blockchain, and fintech best practices at the InfoQ ArchSummit in Shenzhen (discounted tickets available).

Cloud NativekubernetesInfrastructureContainer OrchestrationJDOSplatform migration
DevOps
Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.