Practical Case Study of System Storage Expansion, Upgrade, and Optimization
This article presents a detailed technical case study on expanding and optimizing a system's storage capacity, covering business background, current architecture, implementation plans, technology selection, data synchronization strategies, phased rollout steps, results, and remaining challenges, with concrete metrics and diagrams.
1. Business Background
The system processes and integrates internal data and provides initialization (write) and query services to external systems.
System Network Architecture
Impact of deployment architecture on traffic rollout – the internal management system launch does not affect read operations of other systems.
Distributed cache can be scaled independently, unrelated to storage and query upgrades.
During system expansion, external systems remain unchanged; only the internal management system is upgraded.
During internal system verification, read services continue to be provided, reducing launch impact.
2. Overall Implementation Plan
Goal Setting
Target a ten‑fold increase in data volume, aiming for a maximum support of 900 million records (6.7 billion after full rollout, with 25 % redundancy, rounded to 837.5 million, then to 900 million) to cover the next five years.
Timeline: plan definition early August, rollout and verification 17‑22 Aug, full data migration starting 24 Aug.
Current System Status
Resource Usage
Deployment: MySQL 1 master + 4 slaves (Room A: 1 master, 3 slaves; Room B: read‑only slave).
Doris: 32 cores, 63 nodes, 3 replicas.
Docker containers: 62 total (Web 25, Worker 31, MQ 6).
DB max connections: 100 per container.
No read‑write separation; most operations require immediate consistency.
Background tasks can tolerate master‑slave lag.
External service interfaces are not affected; short‑term delay is acceptable.
Team has limited Elasticsearch experience.
Database Usage
Current tables exceed 50 million rows, some reaching 60 million, hitting MySQL capacity limits; the goal is to support up to 900 million rows.
3. Technical Solution Selection
System characteristics: high‑concurrency writes on single tables, complex reads.
Storage Selection
Distributed DB expanded from single‑shard to multi‑shard to handle massive data and simple queries.
Introduce Elasticsearch for complex (full‑text) queries and global sorting.
Retain Redis with required scaling.
Retain Doris with increased capacity.
Complex queries arise from multi‑table joins on tens‑of‑millions‑row tables, causing performance degradation.
Data Synchronization方案
A. Near‑real‑time sync: use internal DRC platform to sync distributed DB to Elasticsearch (simple, no code, but may have consistency risk for write‑then‑read scenarios).
B. Dual‑write strong consistency: write to both distributed DB and Elasticsearch (ensures consistency but higher development cost).
Recommendation: start with A, validate, then consider B if needed.
Challenges & Solutions
Joint queries cannot be directly synced via DRC; need custom sync‑module JAR or code‑based sync.
Elasticsearch index size and duplicate records increase query complexity.
Workflow tables require redundant fields for Elasticsearch tokenization; add reviewer fields separated by spaces and leverage ES tokenization for efficient queries.
Solution cost includes adding fields to DB tables and developing a historical data refresh tool.
4. Phased Development & Rollout Steps
Business table schema changes (add sharding field, ES redundant fields) – 10 Aug.
Distributed DB sharding and ES initialization; configure DRC for full and incremental sync from single DB to sharded DB and from sharded DB to ES; verify data consistency.
Read traffic migration using AOP interceptors and DUCC configuration; gradually shift reads to new cluster.
Write traffic migration: notify stakeholders, display static upgrade page, disable writes on old DB, ensure full sync, switch to read‑write account, restart workers and MQ.
5. Post‑Launch Effects
Since 23 Aug, the system has migrated 260 million products, supports 316 million product‑dimension records, with the largest DB table holding 284 million rows and Elasticsearch storing 43.56 million records. Query time improved from 9 seconds to 1 second for a sample ERP account.
6. Summary
The comprehensive system assessment, clear rollout plan, and phased execution enabled successful scaling of storage capacity and performance improvement.
Good Recommendations
Perform a thorough, clear system status inventory to reduce complexity and improve quality.
Maintain a clear rollout schedule to guide team division of labor, shorten implementation time, and lower difficulty.
Unresolved Issues
Distributed DB transactions are weak across shards; cross‑shard multi‑record modifications require post‑commit verification.
When a user owns millions of products, query latency remains high and needs further optimization.
END
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
