How JD Built a Scalable Elastic Cloud Platform: Architecture, Challenges, and Lessons
This article details JD.com's Elastic Cloud 1.0 platform—its massive container deployment, four‑principle philosophy, architectural design, operational challenges, performance optimizations, and the roadmap toward Elastic Cloud 2.0—offering practical insights for large‑scale cloud engineering.
Preface
Why the four words "more, faster, better, cheaper" are used and what they mean for JD's elastic cloud.
More – All JD business runs on Elastic Cloud 1.0, with over 150,000 containers.
Faster – A small ops team (2‑3 people) maintains thousands of services; rapid incident response.
Better – Since 2014 the platform has been stable with no major outages.
Cheaper – Deployment is automated via a simple request, saving time and manpower.
1. Elastic Cloud V1.0 Online Operation
Timeline of the private compute platform started before Double‑11 2014, choosing containers over VMs, scaling to 150k containers, and planning Elastic Cloud 2.0 with new features.
2. Challenges
Challenge 1: Choice
Why containers were selected: compatibility with existing physical‑machine workloads, better physical performance than VMs, and acceptable security isolation for a private cloud.
Challenge 2: Forward Compatibility
Gradual migration strategy keeping both containers and physical machines during transition.
Challenge 3: Core System Selection
The single‑product page service was used as the first core workload to prove stability before wider rollout.
Challenge 4: Container Performance & Stability
Shared host resources can cause a single failure to affect all containers; mitigated by usage guidelines.
Challenge 5: Scale
JD now runs more than 150,000 containers across all business lines.
Challenge 6: Operations Cost
Only three engineers maintain the entire fleet, thanks to robust tooling.
3. Elastic Cloud 1.0 Architecture
Stable since 2014; no major upgrades except bug fixes. Core components include JFS for storage, OVS for networking, optional DPDK for high‑throughput traffic, and a VLAN‑based network model with two NICs (eth0 for control, eth1 for container traffic).
3.1 Architecture Diagram
3.2 Use Cases
Platform for business to request, launch, migrate, and destroy containers; supports resource specifications, IP preservation, and automated monitoring.
4. Full‑Throttle 618 Campaign
Manual scaling and resource reallocation within 30 seconds during peak traffic; vertical scaling of CPU/memory for critical services; pre‑campaign stress testing at 20× normal load.
5. Container Performance Improvements
5.1 Hardware Issues
BIOS power‑saving bug caused uneven CPU usage; resolved by setting max performance mode.
5.2 10 Gbps NIC
Optimized interrupt handling and CPU distribution to match physical NIC throughput.
5.3 Cgroup CPU
Used cpuset isolation; FIFO/RR attempts failed; CPU share can cause interference, so cpuset was preferred.
5.4 Cgroup Memory
Cache reclamation and slab memory pressure can stall the whole host; tuned reclamation thresholds to avoid stalls.
5.5 Cgroup Isolation
High thread counts caused large stack overhead; early CentOS 6.6 cgroup did not isolate users, later versions fixed it.
5.6 Monitoring
Collects CPU, memory, swap, disk I/O, network metrics, and container‑level statistics.
5.7 Large Cluster
6. Future Directions
Elastic Cloud 2.0 will further elevate service‑orientation; some production machines already run the new version.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
