Operations 14 min read

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

dbaplus Community

Oct 16, 2017

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

Technical Operations Experience

Stage 1 – Fine‑grained Division

The team split a monolithic codebase into independent modules by vertical database sharding first (horizontal only when needed) and systematic code decoupling. Horizontal teams (e.g., big‑data) and vertical business teams were created to parallelize development.

Key incidents and mitigations:

Timeout cascade : A slow backend service increased RPC latency, causing a front‑end avalanche. Adding a circuit‑breaker allowed the front‑end to fail fast and automatically recover when the backend stabilized.

Redis overload : Network jitter produced an explosion of Redis connections, inflating response time from ~1 ms to 300 ms and triggering a cascade failure. Engineers reproduced the issue for three days, then built a monitoring tool that collects /proc metrics every 10 seconds, enabling fault localization within three minutes.

Additional practices: monitoring was divided into Metric, Log, Trace, and Infrastructure layers; a NOC team handled on‑call alerts; service governance, SOA, release processes, and degradation mechanisms were integrated.

Stage 2 – Stability Maintenance (Capacity Focus)

Rapid growth made capacity the primary stability risk. The team instituted regular online full‑chain load tests, mobilising hundreds of engineers for a month‑long campaign that identified and fixed ~200 hidden bottlenecks.

Flash‑sale (秒杀) incident:

During a 517 flash‑sale, traffic spiked to 50× normal, saturating front‑end Nginx and causing network congestion.

Root causes: insufficient flash‑sale experience, lack of traffic‑shaping, and missing priority‑based rate limiting.

Remediation: built a dedicated protection system with tiered rate limiting, client‑side caching, lane isolation, cloud clusters, and competitive caching.

Stage 3 – Efficiency Gains

Focus shifted to tooling, resource optimisation, and architectural refactoring.

BeeBird delivery failures : Aggressive message retries filled the RabbitMQ (RMQ) backlog, exhausted UDP handles, and misused circuit‑breakers. Solution: stricter retry policies and tighter component governance.

MySQL slow queries : Weekly slow‑query count dropped from 2‑3 per week to near zero after service‑izing components, applying rate limiting, and adding degradation paths.

RMQ reliability : Issues such as partition recovery failures, queue blockage, and excessive connection recreation were mitigated by maintaining a cold‑standby RMQ cluster and enforcing connection reuse.

Governance emphasized deep component expertise, knowledge transfer, and embedding best‑practice rules into resource‑request and architecture‑review processes.

Operational Insights

Incidents are largely preventable through correct usage patterns, capacity forecasting, and gradual roll‑outs. Recommended practices:

Conduct systematic “big sweeps” of historical incidents to extract reusable preventive procedures.

Maintain unified, high‑visibility monitoring to avoid fragmented information that slows fault isolation.

Automate detection and diagnostics to reduce mean‑time‑to‑detect (MTTD) and mean‑time‑to‑recover (MTTR).

Run regular blind‑drill or chaos‑engineering exercises to keep engineers familiar with critical paths and failure modes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Incident Management capacity planning service reliability

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Technical Operations Experience

Stage 1 – Fine‑grained Division

Stage 2 – Stability Maintenance (Capacity Focus)

Stage 3 – Efficiency Gains

Operational Insights

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Stage 1 – Fine‑grained Division

Stage 2 – Stability Maintenance (Capacity Focus)

Stage 3 – Efficiency Gains