Evolution of Ele.me's Operations Infrastructure: From 1.0 to 2.0 – Standardization, Automation, and Data‑Driven Management
The article recounts Ele.me's rapid growth and the resulting operational challenges, describing how the company progressed from ad‑hoc 1.0 practices to a standardized, automated 2.0 infrastructure built on ZStack private cloud, fine‑grained operations, and data‑driven management to improve quality, efficiency, and cost.
Introduction – Xu Wei, senior operations manager at Ele.me, introduces his background and explains that the company experienced explosive growth from 2014 onward, which created massive infrastructure challenges.
1.0 Era (2014‑2015) – Rapid business expansion led to unstandardized IP allocation, frequent attacks, bandwidth bottlenecks, missing monitoring, single points of failure, and chaotic basic services such as ad‑hoc networking, manual server provisioning, and inconsistent logging.
What We Did
We focused on three pillars: standardization (hardware, network, OS, software installation, logging paths, deployment methods), process‑driven workflows , and building a platform to enable automation.
2.1 Standardization – Defined uniform server models (compute, storage, memory, high‑I/O) and required users to choose from these models; standardized procurement, rack placement, OS installation, and network configuration; modularized resources for predictable scaling.
2.2 Process + Automation – Implemented a workflow engine that turns resource lifecycle steps (e.g., physical/virtual server requests, recycling) into automated actions, reducing manual effort and errors.
2.3 Automation + Platform – Achieved large‑scale physical server auto‑installation (up to 2,500 servers per day), network device auto‑onboarding, unified resource management platform, distributed file system for backups, and centralized ELK logging.
2.4 Private Cloud (ZStack) – Chose ZStack over OpenStack and CloudStack for its simplicity and stateless API design; deployed it to manage over 6,000 VMs, handling capacity planning and VM placement.
2.0 Era (From 2016) – Shifted focus to measurable SLA, data‑driven operation, and quantifiable efficiency; introduced fine‑grained operations and data‑driven metrics.
4.1 Fine‑grained Operations – Continuous network upgrades, server performance baselines, automated delivery quality checks, hardware fault auto‑repair, traffic analysis, automated reboot, and bug fixes for power‑saving and bonding issues.
4.2 Data‑driven Operation – Collected asset inventory, network traffic, server utilization, SLA metrics, cost accounting, and supplier quality scores; used dashboards to drive decisions and optimize resources.
Q&A Highlights
Server types are guided (compute, storage, memory, high‑I/O) and reviewed during architecture reviews.
Log aggregation uses a custom Flume‑based platform with selective filtering and SDK‑driven traceability.
Performance testing relies on I/O benchmarks per server model.
Development and operations teams experiment with separation, integration, and hybrid models to improve collaboration.
Conclusion – Emphasizes simplicity, standardization, automation, and data‑driven decision making as the core principles for sustainable, high‑scale operations in fast‑growing internet companies.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.