Operations 23 min read

Scaling Ele.me’s Infrastructure: Operations, Automation, and Private Cloud Insights

This article recounts Ele.me's rapid growth from 2014 onward, detailing the challenges of network and server management, the evolution of their operations through standardization, process automation, and platform building, and how private cloud solutions like ZStack enabled fine‑grained, data‑driven infrastructure management.

Efficient Ops

Nov 5, 2017

Scaling Ele.me’s Infrastructure: Operations, Automation, and Private Cloud Insights

1.0 Era

From 2014 to 2015 Ele.me experienced explosive growth, rapidly adding servers without long‑term architectural planning, which led to numerous technical debts and "pain points".

Network Pain

No standardization: chaotic IP assignments, multiple IPs per server, inconsistent bonding.

Frequent attacks causing outages.

Low bandwidth headroom; switches and NICs quickly saturated.

Missing monitoring; incidents only discovered via user complaints.

Single points of failure across services and hardware.

Unstable link quality.

Server Pain

Delayed server delivery; weekly peaks of 3,700+ servers.

Lack of asset management standards, leading to high maintenance costs.

Manual assembly resulting in inconsistent quality.

Basic Service Gaps

Inconsistent monitoring (e.g., missing disk or IOPS metrics).

Ad‑hoc load balancing with isolated Nginx instances.

Decentralized file storage causing log retention and troubleshooting difficulties.

What We Did

We focused on three core actions: standardization, process automation, and platform construction.

2.1 Standardization

We established systematic standards for hardware, networking, OS, software installation, log paths, deployment methods, and monitoring, enabling code‑driven automation.

Example: instead of arbitrary machine requests, we offer predefined models (compute‑optimized, storage‑optimized, memory‑optimized, high‑I/O) and guide users to select the appropriate one.

2.2 Process + Automation

We built a workflow engine to automate resource lifecycle steps such as physical/virtual server provisioning, cloud service requests, and decommissioning, turning user inputs into automated backend actions.

2.3 Automation + Platform

Automated physical server installation (up to 2,500 servers per day).

Automated network device onboarding.

Unified resource management platform.

Distributed file system for DB backups and image processing.

Centralized log platform (ELK) eliminating server‑side log access.

2.4 Private Cloud (ZStack)

We evaluated OpenStack, CloudStack, and ZStack, selecting ZStack for its simplicity, statelessness, and API‑centric design, ultimately managing over 6,000 VMs.

2.0 Era

Starting in 2016 we shifted to data‑driven operations, emphasizing measurable SLA, efficiency, and cost metrics.

4.1 Fine‑Grained Operations

Continuous network architecture upgrades.

Server performance baselines and delivery quality checks.

Automated hardware fault reporting.

Network traffic analysis.

Automated server reboot.

Bug fixes (e.g., power‑saving mode, bonding).

4.2 Data‑Driven Operation

We visualized asset distribution, network traffic, server utilization, SLA compliance, cost accounting, and supplier quality, turning raw metrics into actionable reports.

Summary

The journey illustrates resource lifecycle management, emphasizing simplicity, standardization, automation, and data‑backed decision making across operations.

本文转载自公众号「msup」

Q&A

Q: Do you guide business units on server requirements? A: Yes, we classify servers into compute, storage, memory, and high‑I/O types and conduct architecture reviews for new services.

Q: Is the log platform built in‑house? A: We use a centralized Flume‑based system with custom scripts for log collection, filtering, and routing.

Q: How do you handle high‑volume log ingestion? A: Flume handles up to ~20k events/sec; for higher loads we distribute Flume agents and monitor back‑pressure.

Q: How do development and operations collaborate? A: We experimented with separate, siloed, and integrated models; currently we have mixed teams where some members handle both development and ops.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing Automation Operations Standardization infrastructure Private Cloud

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.