Three Storage Solutions for Cross-Database Aggregated Full-Text Search
The article compares three approaches—synchronous dual write, asynchronous dual write with a message queue, and CDC via Canal—to keep Elasticsearch and a relational database consistent for cross‑database aggregated full‑text search, outlining their steps, advantages, and drawbacks.
Scenario Assumption
Assume a product (SPU) has multiple specifications (SKU). The business requires paginated queries by SKU, aggregation at the SPU level, additional filter conditions, and paginated display of product information.
Because cross‑database pagination is needed, traditional relational databases become inadequate, so data redundancy is introduced using Elasticsearch (ES). ES aggregates SPU+SKU data and the application queries ES directly, which raises the need to keep ES and the database consistent. The following solutions are presented.
Solution 1: Synchronous Dual Write
Application writes to the database and stores SPU‑aggregated SKU in ES.
Back‑door batch refresh of ES data.
Scheduled full‑refresh of ES.
This is the simplest technique: updates to the DB are mirrored to ES. However, ES initialization, index rebuilding, or data anomalies (e.g., one‑sided SQL execution) may require back‑door batch fixes and possibly a scheduled full refresh. Drawbacks include frequent data conflicts, overwrites, and loss, making it risky.
Solution 2: Asynchronous Dual Write
SPU aggregates SKU into ES.
Use a message queue (MQ) to asynchronously notify the application for incremental ES refresh.
Back‑door batch refresh of ES.
Scheduled full‑refresh of ES.
This approach introduces an MQ, which must be highly available to guarantee that messages are correctly consumed, and it tolerates duplicate consumption. Incremental updates need a switch to prevent a pre‑release environment from rebuilding ES indexes from being overwritten by production increments. The drawbacks are similar to Solution 1: strong coupling with the business system, the need for custom code per business, and reduced agility.
Solution 3: CDC Technique
Change Data Capture (CDC) captures database changes from the binlog via the Canal middleware, pushes them to a middle program, and synchronizes them to ES.
SPU aggregates SKU into ES.
Canal captures binlog changes of relevant tables.
The database change monitoring service sends captured records to MQ.
MQ asynchronously notifies the application for incremental ES refresh.
Back‑door batch rebuild of ES index.
The application only consumes incremental MQ messages. CDC is fast and precise, with low coupling to the application, making it suitable for large‑scale use. It requires high availability of both MQ and Canal, and a back‑door batch index rebuild may still be needed.
Conclusion
Solution 1 is not recommended. In simple business scenarios, Solution 2 can handle data volumes under one million with moderate intrusion. Solution 3 adds a middleware layer but has low intrusion and is appropriate for complex scenarios. The final choice should be based on a thorough business evaluation of trade‑offs.
Architect's Journey
E‑commerce, SaaS, AI architect; DDD enthusiast; SKILL enthusiast
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
