Operations 14 min read

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

dbaplus Community
dbaplus Community
dbaplus Community
How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

Technical Operations Experience

Stage 1 – Fine‑grained Division

The team split a monolithic codebase into independent modules by vertical database sharding first (horizontal only when needed) and systematic code decoupling. Horizontal teams (e.g., big‑data) and vertical business teams were created to parallelize development.

Key incidents and mitigations:

Timeout cascade : A slow backend service increased RPC latency, causing a front‑end avalanche. Adding a circuit‑breaker allowed the front‑end to fail fast and automatically recover when the backend stabilized.

Redis overload : Network jitter produced an explosion of Redis connections, inflating response time from ~1 ms to 300 ms and triggering a cascade failure. Engineers reproduced the issue for three days, then built a monitoring tool that collects /proc metrics every 10 seconds, enabling fault localization within three minutes.

Additional practices: monitoring was divided into Metric, Log, Trace, and Infrastructure layers; a NOC team handled on‑call alerts; service governance, SOA, release processes, and degradation mechanisms were integrated.

Stage 2 – Stability Maintenance (Capacity Focus)

Rapid growth made capacity the primary stability risk. The team instituted regular online full‑chain load tests, mobilising hundreds of engineers for a month‑long campaign that identified and fixed ~200 hidden bottlenecks.

Flash‑sale (秒杀) incident:

During a 517 flash‑sale, traffic spiked to 50× normal, saturating front‑end Nginx and causing network congestion.

Root causes: insufficient flash‑sale experience, lack of traffic‑shaping, and missing priority‑based rate limiting.

Remediation: built a dedicated protection system with tiered rate limiting, client‑side caching, lane isolation, cloud clusters, and competitive caching.

Stage 3 – Efficiency Gains

Focus shifted to tooling, resource optimisation, and architectural refactoring.

BeeBird delivery failures : Aggressive message retries filled the RabbitMQ (RMQ) backlog, exhausted UDP handles, and misused circuit‑breakers. Solution: stricter retry policies and tighter component governance.

MySQL slow queries : Weekly slow‑query count dropped from 2‑3 per week to near zero after service‑izing components, applying rate limiting, and adding degradation paths.

RMQ reliability : Issues such as partition recovery failures, queue blockage, and excessive connection recreation were mitigated by maintaining a cold‑standby RMQ cluster and enforcing connection reuse.

Governance emphasized deep component expertise, knowledge transfer, and embedding best‑practice rules into resource‑request and architecture‑review processes.

Operational Insights

Incidents are largely preventable through correct usage patterns, capacity forecasting, and gradual roll‑outs. Recommended practices:

Conduct systematic “big sweeps” of historical incidents to extract reusable preventive procedures.

Maintain unified, high‑visibility monitoring to avoid fragmented information that slows fault isolation.

Automate detection and diagnostics to reduce mean‑time‑to‑detect (MTTD) and mean‑time‑to‑recover (MTTR).

Run regular blind‑drill or chaos‑engineering exercises to keep engineers familiar with critical paths and failure modes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOperationsincident managementcapacity planningservice reliability
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.