From Monolith to Microservices: Eleme’s Order System Evolution and Key Lessons

The article chronicles Eleme’s four‑year journey of transforming its order platform—from the early Python‑based Zeus monolith through team formation, service decoupling, sharding, message broadcasting, testing automation, Redis and cache redesign, to virtual product handling and governance—highlighting practical challenges, decisions, and hard‑won engineering insights.

dbaplus Community
dbaplus Community
dbaplus Community
From Monolith to Microservices: Eleme’s Order System Evolution and Key Lessons

1. The Early Era ("Taigu")

In 2012 Eleme ran a single‑machine Python monolith called Zeus that bundled core modules such as orders, users, and restaurants, communicating via the Thrift protocol. The PC front‑end and the merchant portal (NaposPC) were separate, while auxiliary functions lived in a PHP system named Walle .

The first order‑related commit ("add eos service for zeus") was made by Yu Lixin on 2012‑09‑01, introducing the ElemeOrderService (EOS) which later became the canonical term for the forward‑facing order service.

Zeus was later refactored into Zeus2 , though the exact timing is unclear.

2. Sprouting the Order Group

In late 2014 the author joined Eleme as an intern and soon after helped migrate an older BD system to Walis , then led the migration of Walis from a single‑app to a distributed architecture.

By May 2015 a dedicated Order Group was formed with just two members. The first month was spent reading code, mapping business logic, and producing a comprehensive diagram of the order lifecycle.

2.1 Zeus Decoupling

Starting around June 2015 the team began splitting the monolithic Zeus into several services:

zeus.eos → Order Service

zeus.eus → User Service

zeus.ers → Merchant Service

zeus.eps → Marketing Service (new)

zeus.sms → SMS Service

The decoupling proceeded in stages:

July – shared code repository, each module could be started independently on specific machines.

August – Proxy stage: add a proxy on the old service to forward traffic to the new service, controlled by a service‑registry switch.

Aug‑Sep – Complete script and module refactoring.

September – Use git filter‑branch to extract each module’s history into its own repository while still deploying as a mixed codebase.

September – Migrate configuration from SaltStack to a service‑registry based configuration system.

Following March – Physical deployment isolation (second‑phase).

The effort produced a lightweight Python SOA framework zeus_core that was extracted before the business services.

2.2 Sharding (Database Partitioning)

In late 2015 the team began a two‑dimensional sharding strategy (120 shards per dimension) routing by user ID, merchant ID, or order ID. The motivations were:

Inability of the 1‑master‑5‑slave MySQL cluster to handle peak concurrency.

High DDL cost – schema changes required hours of downtime and CEO approval.

Key steps included defining a new order‑ID generation rule, dual‑writes during migration, rewriting incompatible SQL, and finally switching reads and writes to the new shards.

2.3 Message Broadcasting

To further decouple the system, the team introduced an order‑event broadcast built on RabbitMQ (after evaluating NSQ, RocketMQ, Kafka, ActiveMQ). A three‑broker cluster was dedicated to order events, with client‑side fault‑tolerance for connection timeouts and retries. Early production incidents revealed a bug where HAProxy closed RabbitMQ connections, causing severe request timeouts; fixing HAProxy’s timeout settings resolved the issue.

Design principles for the broadcast:

Expose order state as events, not direct status.

Broadcast only events; consumers fetch detailed data via APIs.

Consumers must be idempotent and stateless; ordering is handled via Redis if needed.

Topic/Queue naming conventions, max queue depth 10k, and performance impact monitoring.

3. Exploration Phase (OSC – Order Service Center)

From mid‑2015 to early‑2016 the team explored a Java‑based rewrite called OSC (Order Service Center). The goal was a clean, minimal order snapshot that avoided heavy coupling. The language shift to Java only materialized in 2019.

Key activities:

Built an automated integration test platform using RobotFramework , Jenkins, and a simple Django UI (later abandoned due to UI limitations).

Introduced Docker for test‑environment isolation (pre‑containerization era).

Established a layered testing approach: business libraries, validation components, integration suites, and regression suites.

Although the test platform eventually failed due to insufficient development skill and UI issues, the automated regression pipeline remained valuable for later order‑service refactors.

4. Performance Testing

The team adopted Locust for load testing because it integrated well with the Python SOA stack. Early incidents (e.g., a high‑QPS query that overloaded a slave DB) highlighted the need for systematic performance validation before releases.

5. Redis and Cache Improvements

By mid‑2016 Redis became a bottleneck. The team replaced Twemproxy/Codis with an in‑house proxy Corvus that exposed rich metrics. They discovered that interface‑level caching with per‑second TTL caused near‑100% cache misses during peak polling, leading to memory spikes. The solution was to remove the problematic caches, split table‑level and interface‑level caches into separate clusters, and increase cluster memory capacity.

6. Messaging Enhancements

After several MQ‑related outages, the team refined the RabbitMQ topology: they eliminated dead‑letter queues that created single‑point‑of‑failure nodes, added automatic queue declaration in service configs, and later replaced RabbitMQ with a Go‑based solution ( MaxQ ) to avoid the earlier pitfalls.

7. Virtual Product & Innovation

In late 2015/early 2016 the team built a lightweight virtual‑product order system for breakfast and delivery membership cards. The design used two tables: a core order table (buyer ID, status, business type, amount) and an extension table (item list, marketing info, phone). This allowed rapid (2‑3 day) integration for new platform‑wide products.

8. Service & Business Governance

Key governance actions included:

Splitting the order group into forward (order) and reverse (refund) services, each with distinct performance and complexity requirements.

Introducing a dedicated order‑completion state, then delegating refund logic to a separate team.

Refactoring ToC/ToB/ToD logistics flows: merging ToB and ToD into a single service ( osc.blink), simplifying RPC calls, and reducing cross‑datacenter latency.

9. Lessons & Conclusions

By the end of 2016 the order system comprised multiple decoupled services, sharded databases, a robust messaging layer, and a disciplined testing and performance culture. The author emphasizes several habits:

Clear ownership and permission management for code, releases, and credentials.

Regular cleanup of dead code, stale configurations, and noisy logs.

Commitment to automated testing, performance testing, and chaos engineering.

Continuous learning, knowledge sharing, and keeping solutions simple.

The evolution was driven primarily by business needs and proactive engineering, with many improvements occurring after incidents, underscoring the importance of resilience and observability.

Zeus architecture diagram
Zeus architecture diagram
Sharding physical layout
Sharding physical layout
Sharding update flow
Sharding update flow
Message broadcast topology
Message broadcast topology
Testing platform screenshot
Testing platform screenshot
Performance test results
Performance test results
ToB/ToD integration diagram
ToB/ToD integration diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend Engineeringdistributed architecturePerformance TestingMessagingorder system
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.