From Monolith to Scalable Order System: Lessons from Ele.me’s 4‑Year Evolution
Over four years at Ele.me, the author chronicles the transformation of the order platform from a single‑machine Python monolith to a distributed, sharded, message‑driven architecture, detailing challenges in scaling, performance testing, fault injection, Redis usage, and service separation that shaped today’s robust backend.
Why This Article Was Written
The author spent four years in Ele.me’s transaction team, documenting stories and lessons that are rarely known outside the company. The goal is to preserve the evolution of the order system, its pain points, and the reasoning behind each architectural change.
1. The "Taichu" Era (2012‑2014)
Ele.me’s early order system, called Zeus , was a single‑machine Python monolith that bundled order, user, and restaurant modules together and communicated via the Thrift protocol. Supporting services such as Walle (PHP) and the PC front‑end were also tightly coupled.
2. Formation of the Order Team
In late 2014 the author joined as an intern, later becoming a full‑time engineer. By May 2015 the Order team was officially created with just two members. The first month was spent reading code, drawing diagrams, and documenting the entire order lifecycle.
3. Zeus Decoupling (2015‑2016)
To enable independent development, Zeus was split into several services:
zeus.eos → Order Service
zeus.eus → User Service
zeus.ers → Merchant Service
zeus.eps → Marketing Service
zeus.sms → SMS Service
The decoupling proceeded in phases: code‑base separation, proxy layer, full code migration, configuration isolation, and finally physical deployment separation.
4. Sharding and Database Refactoring
By mid‑2015 the order database could no longer handle the daily million‑plus orders. A two‑dimensional sharding strategy (120 shards per dimension) was introduced, routing by user ID, merchant ID, or order ID. The migration involved:
Creating a new order number scheme
Dual‑write to old and new tables
SQL compatibility fixes
Gradual cut‑over to the sharded schema
5. Message Broadcasting
To further decouple services, a RabbitMQ‑based order event broadcast was built. After evaluating several MQ solutions, RabbitMQ was chosen for operational familiarity. The system used a three‑broker cluster, added per‑service fault‑tolerance, and introduced a feature flag to control message emission.
6. Testing, Performance, and Chaos Engineering
A dedicated testing team was formed in early 2016. Automation used RobotFramework for integration tests, Jenkins for scheduling, and a Django UI for test management (named WeBot ). Performance testing employed Locust , revealing bottlenecks in high‑concurrency scenarios. A chaos‑engineering tool called Kennel (inspired by Netflix’s Chaos Monkey) allowed controlled failure injection, uncovering issues such as service registration gaps, load‑balancer time‑outs, and missing hard‑timeouts in the Python SOA framework.
7. Redis Usage and Cache Refactoring
Redis was heavily used for caching, distributed locks, and order ID generation. Early cache‑key designs (interface‑level caching with sliding timestamps) caused massive cache churn and memory exhaustion. After migration to a new Redis proxy (Corvus) and redesigning the cache hierarchy, hit rates improved dramatically and memory usage stabilized.
8. Virtual Goods and Innovation (Breakfast, Membership Cards)
In 2015‑2016 the team built a lightweight virtual‑goods order system for breakfast and membership cards. The data model split the order into a core table (buyer, seller, status) and an extension table (items, promotions). This allowed rapid development (2‑3 days) and low‑latency processing for low‑volume business lines.
9. Service and Business Governance
Key governance actions included:
Separating forward (high‑throughput) and reverse (low‑throughput) flows into distinct services.
Introducing a new order‑completion state to simplify settlement.
Splitting reverse‑flow logic into a dedicated team.
Replacing the heavy ToB/ToD messaging layer with a thin RPC bridge (osc.blink) to reduce latency.
10. Lessons Learned
Throughout the journey the team emphasized:
Clear ownership and permission revocation.
Regular cleanup of dead code and configuration.
Continuous testing (unit, integration, performance, chaos).
Keeping designs simple and easy to understand.
Focusing on the problem, not the person.
These practices turned a chaotic, rapidly growing monolith into a stable, scalable backend that could support millions of daily orders.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
