Scaling Mogujie's Private Cloud for 11.11: Architecture, Stability & Ops Insights
This article details how Mogujie's private cloud platform, built on OpenStack, Docker, and KVM, was engineered and optimized to handle the massive traffic of the 11.11 shopping festival, covering architectural choices, stability measures, monitoring, disaster recovery, performance tuning, and integration with existing operations systems.
For Mogujie, the annual 11.11 shopping festival is the biggest test of system stability, disaster recovery, and rapid fault handling. Their private cloud platform, developed over a year and validated through three major promotions, is described from architecture, technology selection, and application perspectives.
Technical Architecture
The platform provides internal business teams with a foundational IaaS/PaaS service built on Docker‑based CaaS and KVM‑based IaaS. OpenStack is used to manage both containers and virtual machines, while Docker offers lightweight, fast‑starting, standardized packaging and image‑based gray‑release capabilities. KVM handles workloads requiring stronger isolation and security.
Stability Measures
Key stability improvements include upgrading the kernel to version 2.6.32‑504 to fix network namespace crashes, disabling device‑mapper discard to avoid random kernel crashes, and prohibiting disk over‑provisioning that could render filesystems read‑only.
Monitoring Enhancements
A custom container‑level monitoring tool calculates load per container for fine‑grained QPS throttling and replaces host‑wide commands (top, free, iostat, uptime) with container‑aware equivalents. Host monitoring adds multi‑dimensional thresholds for process health, kernel logs, PID counts, network connections, and OOM alerts.
Disaster Recovery and Emergency Handling
Disaster recovery strategies include offline data recovery for Docker using dmsetup create to mount temporary device‑mapper devices, and support for cold migration of containers across physical hosts via a one‑click management interface.
Integration with Existing Operations Systems
The Docker cluster integrates seamlessly with existing operation tools, enabling unified container management and achieving container creation within seven seconds.
Performance Optimizations
System‑level Docker optimizations involve tuning kernel parameters such as vm.dirty_expire_centisecs, vm.dirty_writeback_centisecs, and vm.extra_free_kbytes, and deploying Facebook’s flashcache to use SSD as a cache, dramatically improving I/O performance. Image pull times were reduced by flattening layer hierarchies, cutting size from 1.051 GB (13 layers, 2 min 13 s) to 674.4 MB (1 layer, 26 s).
Conclusion
The 11.11 event served as a comprehensive test of Mogujie's private cloud. While the platform has proven stable, ongoing challenges include container isolation, elastic scheduling, and future adoption of technologies such as Kubernetes, Mesos, CRIU, and runC for hot migration and daemon upgrades.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
