Boost Large Table Queries with Query Separation: Strategies, Implementation, and Storage Choices
Query separation stores a copy of data in a dedicated query system to accelerate large‑table reads, addressing cold‑data latency, write limitations, and scalability, with detailed guidance on when to adopt it, implementation methods (sync, async, binlog), storage options, MQ handling, and consistency considerations.
What Is Query Separation?
Query separation stores a copy of the primary data in a separate, query‑optimized storage system. Writes go to the primary database, while reads are served from the copy, dramatically reducing latency for large‑table queries.
Applicable Scenarios
Very large data volumes (tens of millions of rows or more).
Write performance is acceptable, but read performance is poor.
Data may be updated at any time; cold‑hot separation is not feasible.
The business requires faster query response times.
Example : A SaaS ticket‑service system stored tens of millions of tickets and required joins with dozens of sub‑tables, each containing hundreds of millions of rows. Even with indexes and SQL tuning, queries took dozens of seconds. Because some old tickets could not be archived (legal hold), cold‑hot separation was impossible. By separating writes (primary DB) from reads (a dedicated query store), response time dropped to under 500 ms .
When to Trigger Query Separation?
Trigger query separation whenever a copy of the data should be kept in a query‑optimized database. Three common construction methods are:
Synchronous build
Asynchronous build
Binlog‑based build
1. Synchronous Build
Modify business code to write the primary record and immediately create the query copy.
Advantages : Guarantees real‑time consistency of query data.
Disadvantages : Intrusive code changes and slower write throughput.
2. Asynchronous Build
Business code writes the primary record first, then enqueues a task to build the query copy later.
Advantages : Does not block the main transaction flow.
Disadvantages : Potential temporary inconsistency between primary and query stores.
3. Binlog‑Based Build
Listen to the database binary log (binlog) and construct the query copy from change events. This approach requires no code changes in the business layer.
Advantages : Zero intrusion on business code; main flow remains unaffected.
Disadvantages : Consistency guarantees are more complex and the architecture is heavier.
Implementing Asynchronous Query Separation with a Message Queue (MQ)
Pure in‑memory async processing is limited by memory size and loses data on service restart. Using a durable MQ solves these issues.
If an MQ is already in use, extend the existing pipeline.
If no MQ exists, evaluate and adopt a suitable solution (e.g., Kafka, RabbitMQ, Pulsar).
MQ Outage Handling
When the MQ crashes, query data cannot be built. Add a flag column (e.g., migrated / not_migrated) to each primary record. After the MQ recovers, scan for rows with not_migrated and replay the missing messages.
Idempotent Consumption
Ensure each message is processed exactly once. Typical techniques include:
Deduplication tables keyed by a unique message ID.
Database upserts that are safe to repeat.
This prevents duplicate inserts in the query store (e.g., an order update being applied twice).
Message Ordering
Updates for the same entity must be applied in the order they occurred. If a later update is processed before an earlier one, stale data may overwrite newer data. Strategies:
Partition the MQ by entity ID so that all messages for the same entity are processed sequentially.
Include a monotonically increasing version or timestamp in the message and let the consumer discard out‑of‑order updates.
Choosing a Storage System for Query Data
Relational databases typically cannot handle the read‑heavy workload at massive scale. Common alternatives are:
MongoDB – document store with flexible schema and secondary indexes.
HBase – wide‑column store optimized for massive write/read throughput.
Elasticsearch – full‑text search engine with powerful aggregation and near‑real‑time indexing.
The final choice should consider existing infrastructure, operational expertise, and query requirements. In the cited case, Elasticsearch was selected for its rich query DSL and the team’s familiarity.
Consuming Query Data
Each query store provides its own API (REST, native client libraries, etc.). Because the query copy may lag behind the primary database, two practical consistency strategies are:
Block user queries until the latest update is reflected (rarely used due to latency impact).
Show a warning that the displayed data may be up to one second stale and allow the user to refresh if needed.
Conclusion
Query separation is an effective technique for accelerating reads on very large tables by offloading queries to a dedicated store. It improves read latency dramatically but introduces trade‑offs such as potential write‑path overhead (especially for synchronous builds) and added system complexity (e.g., MQ management, consistency handling). These trade‑offs should be weighed against business requirements, and further optimisations (e.g., write‑performance tuning) may be needed in subsequent work.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
