Comprehensive System Refactoring Case Study: From Monolithic to Distributed Architecture
This article details a complete system refactoring project, describing the original monolithic bottlenecks, the design of a distributed architecture, database model reconstruction, phased migration, testing, and rollout strategies to achieve scalable, reliable backend services.
01 Background
The original system was built as an all‑in‑one monolith. Rapid business growth caused a sharp increase in user traffic, leading to performance degradation and numerous stability issues. The initial architecture diagram is shown below.
02 Pain Points
The main problems encountered were:
Severe module coupling prevented rapid scaling.
Mixed database tables (e.g., payment and product orders sharing a single table) caused ambiguous status handling.
Complex SQL with many cross‑table joins resulted in slow queries and frequent DB alerts.
Lack of service and domain boundaries led to tightly coupled interfaces and single‑point failures.
Slow API responses, poor stability, and frequent data loss or corruption.
Bulky product requirement versions and scattered business logic slowed iteration.
High customer‑complaint rate and difficult root‑cause analysis exhausted the development team.
Two possible approaches were considered:
Continue iterating on the existing system, which would require more manpower to maintain stability.
Perform a complete system rewrite, accepting short‑term impact on feature delivery.
Given the long‑term product roadmap and the current system’s bottlenecks, a full rewrite was chosen.
The project lead was assigned the responsibility for the refactor, with three key constraints:
Maintain the current pace of business‑requirement delivery, possibly by adding staff.
Design the new system to handle projected traffic and data growth for the next three years.
Ensure a seamless switch‑over without data loss or service disruption.
03 Solution
Refactoring a live, high‑growth system is akin to changing an aircraft’s engine mid‑flight; it requires thorough planning and safeguards.
The technical principles defined were:
Adopt a distributed architecture, splitting all modules into independently deployable services.
Fully redesign the database schema to support future expansion.
Consolidate business logic into domain‑specific services with unified APIs.
Implement dual‑write between old and new databases to guarantee data integrity.
Run old and new systems in parallel, using traffic‑grey‑scale controls until the legacy system can be retired.
04 Implementation
Requirement and Interface Analysis
With the high‑level goals set, the team began detailed implementation. The order module was identified as the core bottleneck, so the refactor was staged, starting with order‑related functionality.
Business requirements and product specifications were reviewed, and real‑time product simulations were used to map the order workflow.
Interface‑level analysis was performed to capture all upstream calls to the order module and downstream dependencies, ensuring complete coverage.
This analysis informed the new database model design.
Data Model Considerations
Key decisions for the new order database included:
Sharding the order table by user‑ID into 64 tables, each handling up to 50 million rows, to accommodate projected growth; additional query dimensions (time, region) are served via Elasticsearch indexes.
Replacing auto‑increment primary keys with a distributed Snowflake‑style ID generator.
Eliminating cross‑table joins by moving aggregation logic to the service layer.
Implementing a dual‑write mechanism with a configurable switch to synchronize data between old and new models.
Architecture Design
The resulting architecture consists of the following layers:
API Gateway (NGINX) : Redirects legacy app requests to the new API services.
API Service Layer : Handles authentication, encryption, and routes calls to appropriate business services.
Business Logic Layer : Provides aggregated order services (detail, list, etc.) and composes data from multiple domains.
Domain Service Layer : Performs CRUD operations on its own tables.
Order Database : Separate from the original monolithic database, potentially further split into domain‑specific schemas.
Two migration phases were planned.
Phase 1 – Traffic Switch
During this phase, NGINX redirects old order APIs to the new services, and a feature flag in the monolithic app toggles calls to the new business logic. Domain services keep read/write on the legacy DB and write‑only on the new order DB.
The diagram below illustrates the Phase 1 architecture:
Key points:
A switch in the API service layer controls whether traffic goes to the new services or the legacy direct‑DB path.
Domain services (order‑domain1‑service, order‑domain2‑service, etc.) have read/write flags for both the legacy and new databases.
Successful verification of this phase confirms that the new service chain works correctly and can be rolled back instantly via the switches.
Phase 2 – Data Validation
After Phase 1, the focus shifts to the data layer. The legacy DB write flag is disabled, while the new order DB read/write flags are enabled, allowing full traffic to flow through the new schema.
During this phase, extensive functional testing, data comparison, and monitoring ensure that the new model stores and serves data accurately.
If any issue arises, the switches can revert traffic back to the legacy path.
Following successful validation, remaining cleanup tasks include removing dual‑write code, feature flags, and legacy logic.
Project Execution
A detailed project plan was created, defining milestones, resource allocation, and risk mitigation strategies. Stakeholder alignment was achieved through clear communication of the refactor’s benefits and potential impacts.
Development tasks covered API changes, data migration scripts, message‑queue compatibility, and cache synchronization.
Comprehensive testing comprised automated API tests, manual scenario coverage, traffic replay, and staged grey‑scale releases in pre‑production.
Release Process
The release plan enumerated each step, estimated duration, assigned owners, and defined rollback procedures. Monitoring dashboards tracked service health, latency, and error rates throughout the rollout.
Gradual traffic shifting validated the new system’s correctness, and after full cut‑over, the legacy monolith was decommissioned.
05 Summary
Key takeaways from the refactor:
Identify and prioritize the most critical system pain points.
Define clear goals, constraints, and success criteria.
Choose feasible technical solutions and validate their viability.
Map all requirements, scenarios, and upstream/downstream dependencies.
Design a comprehensive, well‑documented architecture.
Develop a detailed project schedule with resource commitments.
Execute end‑to‑end testing and verification.
Prepare an exhaustive release plan.
Incorporate grey‑scale validation to mitigate risk.
System refactoring is demanding but provides a valuable opportunity to strengthen engineering capabilities and address hidden bottlenecks.
Readers are encouraged to share their own experiences and insights.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.