Databases 11 min read

Practical Case Study of System Storage Expansion, Upgrade, and Optimization

This article presents a detailed technical case study on expanding and optimizing a system's storage capacity, covering business background, current architecture, implementation plans, technology selection, data synchronization strategies, phased rollout steps, results, and remaining challenges, with concrete metrics and diagrams.

JD Retail Technology

Jan 8, 2024

Practical Case Study of System Storage Expansion, Upgrade, and Optimization

1. Business Background

The system processes and integrates internal data and provides initialization (write) and query services to external systems.

System Network Architecture

Impact of deployment architecture on traffic rollout – the internal management system launch does not affect read operations of other systems.

Distributed cache can be scaled independently, unrelated to storage and query upgrades.

During system expansion, external systems remain unchanged; only the internal management system is upgraded.

During internal system verification, read services continue to be provided, reducing launch impact.

2. Overall Implementation Plan

Goal Setting

Target a ten‑fold increase in data volume, aiming for a maximum support of 900 million records (6.7 billion after full rollout, with 25 % redundancy, rounded to 837.5 million, then to 900 million) to cover the next five years.

Timeline: plan definition early August, rollout and verification 17‑22 Aug, full data migration starting 24 Aug.

Current System Status

Resource Usage

Deployment: MySQL 1 master + 4 slaves (Room A: 1 master, 3 slaves; Room B: read‑only slave).

Doris: 32 cores, 63 nodes, 3 replicas.

Docker containers: 62 total (Web 25, Worker 31, MQ 6).

DB max connections: 100 per container.

No read‑write separation; most operations require immediate consistency.

Background tasks can tolerate master‑slave lag.

External service interfaces are not affected; short‑term delay is acceptable.

Team has limited Elasticsearch experience.

Database Usage

Current tables exceed 50 million rows, some reaching 60 million, hitting MySQL capacity limits; the goal is to support up to 900 million rows.

3. Technical Solution Selection

System characteristics: high‑concurrency writes on single tables, complex reads.

Storage Selection

Distributed DB expanded from single‑shard to multi‑shard to handle massive data and simple queries.

Introduce Elasticsearch for complex (full‑text) queries and global sorting.

Retain Redis with required scaling.

Retain Doris with increased capacity.

Complex queries arise from multi‑table joins on tens‑of‑millions‑row tables, causing performance degradation.

Data Synchronization方案

A. Near‑real‑time sync: use internal DRC platform to sync distributed DB to Elasticsearch (simple, no code, but may have consistency risk for write‑then‑read scenarios).

B. Dual‑write strong consistency: write to both distributed DB and Elasticsearch (ensures consistency but higher development cost).

Recommendation: start with A, validate, then consider B if needed.

Challenges & Solutions

Joint queries cannot be directly synced via DRC; need custom sync‑module JAR or code‑based sync.

Elasticsearch index size and duplicate records increase query complexity.

Workflow tables require redundant fields for Elasticsearch tokenization; add reviewer fields separated by spaces and leverage ES tokenization for efficient queries.

Solution cost includes adding fields to DB tables and developing a historical data refresh tool.

4. Phased Development & Rollout Steps

Business table schema changes (add sharding field, ES redundant fields) – 10 Aug.

Distributed DB sharding and ES initialization; configure DRC for full and incremental sync from single DB to sharded DB and from sharded DB to ES; verify data consistency.

Read traffic migration using AOP interceptors and DUCC configuration; gradually shift reads to new cluster.

Write traffic migration: notify stakeholders, display static upgrade page, disable writes on old DB, ensure full sync, switch to read‑write account, restart workers and MQ.

5. Post‑Launch Effects

Since 23 Aug, the system has migrated 260 million products, supports 316 million product‑dimension records, with the largest DB table holding 284 million rows and Elasticsearch storing 43.56 million records. Query time improved from 9 seconds to 1 second for a sample ERP account.

6. Summary

The comprehensive system assessment, clear rollout plan, and phased execution enabled successful scaling of storage capacity and performance improvement.

Good Recommendations

Perform a thorough, clear system status inventory to reduce complexity and improve quality.

Maintain a clear rollout schedule to guide team division of labor, shorten implementation time, and lower difficulty.

Unresolved Issues

Distributed DB transactions are weak across shards; cross‑shard multi‑record modifications require post‑commit verification.

When a user owns millions of products, query latency remains high and needs further optimization.

END

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Scalability databases

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.