From Monolith to Scalable Order System: Lessons from Ele.me’s 4‑Year Evolution

Over four years at Ele.me, the author chronicles the transformation of the order platform from a single‑machine Python monolith to a distributed, sharded, message‑driven architecture, detailing challenges in scaling, performance testing, fault injection, Redis usage, and service separation that shaped today’s robust backend.

21CTO
21CTO
21CTO
From Monolith to Scalable Order System: Lessons from Ele.me’s 4‑Year Evolution

Why This Article Was Written

The author spent four years in Ele.me’s transaction team, documenting stories and lessons that are rarely known outside the company. The goal is to preserve the evolution of the order system, its pain points, and the reasoning behind each architectural change.

1. The "Taichu" Era (2012‑2014)

Ele.me’s early order system, called Zeus , was a single‑machine Python monolith that bundled order, user, and restaurant modules together and communicated via the Thrift protocol. Supporting services such as Walle (PHP) and the PC front‑end were also tightly coupled.

Zeus architecture
Zeus architecture

2. Formation of the Order Team

In late 2014 the author joined as an intern, later becoming a full‑time engineer. By May 2015 the Order team was officially created with just two members. The first month was spent reading code, drawing diagrams, and documenting the entire order lifecycle.

3. Zeus Decoupling (2015‑2016)

To enable independent development, Zeus was split into several services:

zeus.eos → Order Service

zeus.eus → User Service

zeus.ers → Merchant Service

zeus.eps → Marketing Service

zeus.sms → SMS Service

The decoupling proceeded in phases: code‑base separation, proxy layer, full code migration, configuration isolation, and finally physical deployment separation.

4. Sharding and Database Refactoring

By mid‑2015 the order database could no longer handle the daily million‑plus orders. A two‑dimensional sharding strategy (120 shards per dimension) was introduced, routing by user ID, merchant ID, or order ID. The migration involved:

Creating a new order number scheme

Dual‑write to old and new tables

SQL compatibility fixes

Gradual cut‑over to the sharded schema

Sharding architecture
Sharding architecture

5. Message Broadcasting

To further decouple services, a RabbitMQ‑based order event broadcast was built. After evaluating several MQ solutions, RabbitMQ was chosen for operational familiarity. The system used a three‑broker cluster, added per‑service fault‑tolerance, and introduced a feature flag to control message emission.

Message broadcasting
Message broadcasting

6. Testing, Performance, and Chaos Engineering

A dedicated testing team was formed in early 2016. Automation used RobotFramework for integration tests, Jenkins for scheduling, and a Django UI for test management (named WeBot ). Performance testing employed Locust , revealing bottlenecks in high‑concurrency scenarios. A chaos‑engineering tool called Kennel (inspired by Netflix’s Chaos Monkey) allowed controlled failure injection, uncovering issues such as service registration gaps, load‑balancer time‑outs, and missing hard‑timeouts in the Python SOA framework.

7. Redis Usage and Cache Refactoring

Redis was heavily used for caching, distributed locks, and order ID generation. Early cache‑key designs (interface‑level caching with sliding timestamps) caused massive cache churn and memory exhaustion. After migration to a new Redis proxy (Corvus) and redesigning the cache hierarchy, hit rates improved dramatically and memory usage stabilized.

Redis metrics
Redis metrics

8. Virtual Goods and Innovation (Breakfast, Membership Cards)

In 2015‑2016 the team built a lightweight virtual‑goods order system for breakfast and membership cards. The data model split the order into a core table (buyer, seller, status) and an extension table (items, promotions). This allowed rapid development (2‑3 days) and low‑latency processing for low‑volume business lines.

9. Service and Business Governance

Key governance actions included:

Separating forward (high‑throughput) and reverse (low‑throughput) flows into distinct services.

Introducing a new order‑completion state to simplify settlement.

Splitting reverse‑flow logic into a dedicated team.

Replacing the heavy ToB/ToD messaging layer with a thin RPC bridge (osc.blink) to reduce latency.

10. Lessons Learned

Throughout the journey the team emphasized:

Clear ownership and permission revocation.

Regular cleanup of dead code and configuration.

Continuous testing (unit, integration, performance, chaos).

Keeping designs simple and easy to understand.

Focusing on the problem, not the person.

These practices turned a chaotic, rapidly growing monolith into a stable, scalable backend that could support millions of daily orders.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MicroservicesshardingRedisPerformance TestingMessagingorder system
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.