Databases 13 min read

Achieving Efficient Real‑Time Search for Massive Data in Spring Boot Applications

The article analyzes why massive tables become a bottleneck in Spring Boot systems, outlines the drawbacks of sharding, and presents a layered solution—data archiving, read‑write separation with caching, heterogeneous source synchronization via Elasticsearch and Canal, and selective sharding—to enable high‑performance real‑time search.

Shepherd Advanced Notes

Aug 30, 2023

Achieving Efficient Real‑Time Search for Massive Data in Spring Boot Applications

1. Overview

When a single MySQL table exceeds 5 million rows or 2 GB, computation and I/O become slower. Sharding is a heavy operation that requires extensive code refactoring, data migration, backup, and scaling, and introduces challenges such as complex distributed transactions, pagination, sorting, aggregation, global primary‑key duplication, and ongoing data migration.

2. Solutions for Large Data Volumes

2.1 Data Archiving (Cold/Hot Separation)

In a Meituan‑style scenario, rarely accessed historical orders are moved to a cold archive table while recent orders remain hot. The process consists of:

Migrating qualifying rows to a designated archive table.

Batch‑deleting the original rows by primary key (e.g., 500–1000 rows per batch) to avoid timeouts and lock contention.

SELECT MAX(id) AS maxId FROM t WHERE create_time < 'specified_time';

Then repeatedly:

SELECT * INTO t_bak FROM t WHERE id > startId AND id <= maxId LIMIT 500;
SELECT MAX(id) FROM t_bak;  -- assign to startId for next batch
DELETE FROM t WHERE id <= maxBakId;

2.2 Read‑Write Separation and Hot‑Data Caching

Most workloads are read‑heavy (read/write ratio often > 10:1). A primary‑replica setup directs write traffic to the master and distributes read queries across replicas, reducing load on the master. Frequently accessed hot data can be pre‑populated in Redis to further offload the database.

2.3 Synchronizing Heterogeneous Data Sources

For OLAP‑type queries, offload to big‑data platforms such as Elasticsearch or HBase. Elasticsearch, built on Lucene, offers distributed, real‑time full‑text search via a RESTful API. To keep Elasticsearch in sync with MySQL, the open‑source project Canal is used. Canal captures MySQL binlog changes, parses them, and exposes them through a server interface. It integrates with ZooKeeper for high availability and scalability, allowing downstream consumers to receive row‑level change events without handling raw binlog formats.

2.4 Sharding (Vertical & Horizontal)

If the previous techniques are insufficient, sharding can be applied. Vertical sharding separates databases by business domain (e.g., orders, products, payments). Horizontal sharding distributes rows of a single logical table across multiple databases/tables based on a chosen sharding key (e.g., creation time, hash of a distributed ID). An example splits an order database into four databases, each with four tables, using distributed_id % 4 as the routing rule.

3. Real‑Time Sync to Elasticsearch

The core solution combines data archiving, read‑write separation, and heterogeneous source sync. Two implementation paths for MySQL‑to‑Elasticsearch synchronization are discussed:

Application‑level CRUD mirroring: each insert/update/delete in MySQL also triggers the corresponding operation in Elasticsearch. This offers strong real‑time guarantees but tightly couples business code to the search layer.

Binlog‑based sync using Canal: Canal parses binlog events, which are then streamed (often via Kafka) to a consumer that writes to Elasticsearch. This method is non‑intrusive, highly available, and can batch operations for better throughput, though it may introduce slight latency when joins are required to construct wide index rows.

4. Conclusion

All presented techniques have trade‑offs. Evaluate the specific data characteristics of a system and select a combination of archiving, read‑write separation, cache, and Elasticsearch sync before resorting to sharding. A balanced approach avoids unnecessary complexity while achieving efficient real‑time search on massive datasets.

Repository: https://github.com/plasticene/plasticene-boot-starter-parent

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Sharding Spring Boot MySQL Read‑Write Separation Canal Data Archiving

Written by

Shepherd Advanced Notes

Dedicated to sharing advanced Java technical insights, daily work snippets, and the power of persistent effort.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.