Evolution of Ele.me's Order System: Architecture, Scaling, and Lessons Learned

This article recounts the four‑year journey of Ele.me's order platform, detailing the transition from a monolithic Zeus system to service‑oriented components, the challenges of sharding, message broadcasting, performance testing, Redis optimization, and the cultural practices that shaped a resilient backend architecture.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Evolution of Ele.me's Order System: Architecture, Scaling, and Lessons Learned

The author, a senior architect at Ele.me, explains why they wrote this piece: to preserve four years of undocumented transaction domain knowledge, to help newcomers understand the evolution beyond buzzwords like distributed systems and high traffic, and to share hard‑earned lessons, including both successes and mistakes.

Starting from the "Ancient" era, Ele.me's core services (order, user, restaurant) lived in a single Python codebase called Zeus, communicating via Thrift and backed by a PHP system named Walle. The first order‑related commit appeared in September 2012, marking the birth of the ElemeOrderService (EOS).

In 2015, the order team was formed, beginning with a month of code reading, business mapping, and diagramming. Early efforts focused on decoupling Zeus into independent services (eos, eus, ers, eps, sms, etc.) through shared repositories, proxy layers, script migrations, and finally independent Git histories.

To overcome database bottlenecks, a two‑dimensional sharding strategy with 120 shards per dimension was introduced, accompanied by new order‑ID generation, dual‑write migration, and extensive SQL refactoring. The migration was executed with minimal downtime, dramatically improving throughput.

Message broadcasting was adopted using RabbitMQ (later replaced by a custom solution) to decouple services, with strict guidelines on event‑driven exposure, idempotent consumers, and queue depth limits. Fault‑injection experiments (Kennel) revealed hidden reliability issues and guided resilience improvements.

Testing infrastructure evolved from manual checks to automated integration tests using RobotFramework, Jenkins, and a Django‑based test management UI (WeBot). Performance testing shifted from JMeter to Locust, enabling realistic load scenarios and early detection of concurrency bugs.

Redis usage was refined by separating cache layers, eliminating ineffective interface‑level caching, and migrating to a custom proxy (corvus) that provided richer metrics and better memory management.

Additional innovations included a virtual‑goods order system, a delivery membership card platform, and the gradual separation of forward and reverse order flows, each driven by business needs and scalability concerns.

The article concludes with a set of practical takeaways: clear ownership, regular cleanup, rigorous testing (automation, performance, chaos), continuous learning, and the principle of keeping systems simple and easy to maintain.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend ArchitectureMicroservicesScalabilityPerformance Testingdatabase shardingMessagingorder system
Full-Stack Internet Architecture
Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.