How I Cut a 2‑Month Data Migration to 4 Hours: A Backend Performance Journey
This article chronicles a Java backend data‑migration project where the author transformed a two‑month, 2000‑million‑record process into a four‑hour job by iteratively redesigning the architecture from a monolithic procedural approach to a fully decoupled, multithreaded, queue‑driven system, highlighting key performance bottlenecks and solutions.
1. Project Description
The task involved reading 20 million user records from database A, generating a GUID for each user, inserting them into database B via an SDK registration interface, and creating an association table in database A, while ensuring data recoverability, consistency, and completing the work within one day.
2. First Version: Procedural – 2 Months
Features: single‑threaded, tightly coupled, processing each record sequentially, no recoverability. The workflow read one record, processed it, called the SDK to insert into B, then executed an SQL to insert the association into A. This design caused the entire pipeline to be limited by the slowest step, leading to an estimated two‑month runtime for 20 million records.
Key issues:
Slowest link bottleneck : If any step (e.g., inserting the association) stalled for a minute, the whole JVM waited idle.
SDK HTTP calls per record : Each insertion required a separate HTTP request, analogous to delivering a single apple at a time.
3. Second Version: Object‑Oriented – 21 Days
Features: object‑oriented, still single‑threaded, more extensible, slightly coupled, batch inserts, data recoverable.
Improvements:
Introduced a BatchStrategy configuration object to hold total count, batch sizes, source table/column info.
Split the workflow into three dedicated objects: Reader (read data), Processor (process and forward), Writer (write data).
Added an ErrorHandler to log or handle failed records, decoupling error processing.
Efficiency gains came from batch HTTP calls to the SDK and JDBC batch operations, reducing the runtime to 21 days, but the pipeline remained limited by the slowest component.
4. Third Version: Fully Decoupled (Queue + Multithreading) – 3 Days
Features: object‑oriented, multithreaded, fully decoupled, batch inserts, data recoverable.
Key changes:
Queue : Replaced direct method calls with a thread‑safe ConcurrentLinkedQueue, allowing Reader to enqueue data, Processor to dequeue and process, and Writer to dequeue and write, enabling true asynchronous execution.
Multithreading : Processor and Writer run in parallel threads, each pulling from their respective queues.
Because the Processor was slower than the Writer, batch inserts were no longer beneficial; single‑record inserts proved faster.
Additional bottleneck identified: the MySQL LIMIT operation became increasingly slow for large offsets, prompting a redesign of the paging strategy using the indexed phone number field.
5. Fourth Version: Highly Abstracted (One‑Click Start) – 4 Hours
Features: interface‑driven, multithreaded, extensible, fully decoupled, supports both batch and single inserts, optimized LIMIT queries.
Design highlights:
Unified Job interface with receive, process, and closeInteractive methods.
Reader reads data in batches, Processor handles business logic and forwards results, Writer persists data.
Implemented a paging technique that sorts by phone number (the only indexed column) and uses the last processed phone number as a cursor, avoiding costly full‑table scans.
6. Thoughts on Further Optimization
Potential improvements include parallelizing the Reader (despite database access constraints) and making logging asynchronous to reduce the overhead of millions of log statements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
