Cloud Computing 17 min read

How JD Built a Scalable Elastic Cloud Platform: Architecture, Challenges, and Lessons

This article details JD.com's Elastic Cloud 1.0 platform—its massive container deployment, four‑principle philosophy, architectural design, operational challenges, performance optimizations, and the roadmap toward Elastic Cloud 2.0—offering practical insights for large‑scale cloud engineering.

Efficient Ops
Efficient Ops
Efficient Ops
How JD Built a Scalable Elastic Cloud Platform: Architecture, Challenges, and Lessons

Preface

Why the four words "more, faster, better, cheaper" are used and what they mean for JD's elastic cloud.

More – All JD business runs on Elastic Cloud 1.0, with over 150,000 containers.

Faster – A small ops team (2‑3 people) maintains thousands of services; rapid incident response.

Better – Since 2014 the platform has been stable with no major outages.

Cheaper – Deployment is automated via a simple request, saving time and manpower.

1. Elastic Cloud V1.0 Online Operation

Timeline of the private compute platform started before Double‑11 2014, choosing containers over VMs, scaling to 150k containers, and planning Elastic Cloud 2.0 with new features.

2. Challenges

Challenge 1: Choice

Why containers were selected: compatibility with existing physical‑machine workloads, better physical performance than VMs, and acceptable security isolation for a private cloud.

Challenge 2: Forward Compatibility

Gradual migration strategy keeping both containers and physical machines during transition.

Challenge 3: Core System Selection

The single‑product page service was used as the first core workload to prove stability before wider rollout.

Challenge 4: Container Performance & Stability

Shared host resources can cause a single failure to affect all containers; mitigated by usage guidelines.

Challenge 5: Scale

JD now runs more than 150,000 containers across all business lines.

Challenge 6: Operations Cost

Only three engineers maintain the entire fleet, thanks to robust tooling.

3. Elastic Cloud 1.0 Architecture

Stable since 2014; no major upgrades except bug fixes. Core components include JFS for storage, OVS for networking, optional DPDK for high‑throughput traffic, and a VLAN‑based network model with two NICs (eth0 for control, eth1 for container traffic).

3.1 Architecture Diagram

3.2 Use Cases

Platform for business to request, launch, migrate, and destroy containers; supports resource specifications, IP preservation, and automated monitoring.

4. Full‑Throttle 618 Campaign

Manual scaling and resource reallocation within 30 seconds during peak traffic; vertical scaling of CPU/memory for critical services; pre‑campaign stress testing at 20× normal load.

5. Container Performance Improvements

5.1 Hardware Issues

BIOS power‑saving bug caused uneven CPU usage; resolved by setting max performance mode.

5.2 10 Gbps NIC

Optimized interrupt handling and CPU distribution to match physical NIC throughput.

5.3 Cgroup CPU

Used cpuset isolation; FIFO/RR attempts failed; CPU share can cause interference, so cpuset was preferred.

5.4 Cgroup Memory

Cache reclamation and slab memory pressure can stall the whole host; tuned reclamation thresholds to avoid stalls.

5.5 Cgroup Isolation

High thread counts caused large stack overhead; early CentOS 6.6 cgroup did not isolate users, later versions fixed it.

5.6 Monitoring

Collects CPU, memory, swap, disk I/O, network metrics, and container‑level statistics.

5.7 Large Cluster

6. Future Directions

Elastic Cloud 2.0 will further elevate service‑orientation; some production machines already run the new version.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationcloud computingelastic scalingcontainer orchestrationJD.com
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.