Optimizing Distributed Database Sorting: In-Memory Merge to Disk Buffering
This article examines the challenges of sorting queries in distributed databases, outlines the limitations of in‑memory proxy sorting, and presents a step‑by‑step optimized approach that leverages per‑shard sorting, disk‑based buffering, and priority‑queue merging to reduce memory pressure and I/O overhead.
Background
Typical distributed database architecture consists of a global transaction manager (gtm) providing a global clock, a metadata service (catalog), horizontally sharded data groups, and a stateless proxy that routes client requests to the appropriate shards and coordinates distributed transactions.
gtm : global transaction manager (global clock), primary‑backup.
catalog : metadata management, primary‑backup.
group : horizontal shard, each group has a primary and replicas.
proxy : stateless coordinator that forwards queries, aggregates results, and ensures transactional consistency.
Sorting Problem
A query such as SELECT * FROM t1 ORDER BY field1 may need rows from many shards. The proxy must merge the ordered streams from each shard into a globally ordered result. Keeping all rows in memory works only for small result sets; large result sets cause out‑of‑memory (OOM) or excessive disk usage.
Solution Overview
1. In‑Memory Merge Sorting
Each shard sorts locally before sending data to the proxy, so the proxy receives already ordered streams. The proxy merges these streams using a heap‑based k‑way merge.
Client sends SELECT * FROM t1 ORDER BY id to the proxy.
Proxy forwards the query to the relevant shard groups.
Each shard sorts locally and streams the ordered rows back.
Proxy stores each shard’s stream in a per‑shard sort buffer (default 10 MB, split proportionally when multiple shards are involved).
Proxy merges the buffers with a priority‑queue (heap) and returns the globally ordered result.
2. Disk‑Based Sorting for Large Result Sets
If the total result exceeds available memory, the proxy writes each shard’s ordered stream to a temporary disk file and performs a heap‑based merge directly from disk.
Client issues SELECT * FROM t1 ORDER BY id.
Proxy forwards the query to each shard.
Shards sort locally and stream ordered rows.
Proxy writes each shard’s stream to a separate disk file.
Proxy builds a heap where each node holds the next row from a file; the heap size equals the number of shards.
When the heap top is emitted, the proxy reads the next row from the same file and pushes it into the heap.
Process continues until all rows are sent.
3. Final Optimized Scheme (Batch Pull & Merge)
The optimized design avoids persisting any shard data on the proxy’s disk. The proxy repeatedly pulls a fixed‑size batch of ordered rows from each shard, fills per‑shard sort buffers, and merges them with a priority queue. This loop continues until the client’s requested row count is satisfied, minimizing both memory and I/O.
Client sends SELECT * FROM t1 ORDER BY id to the proxy.
Proxy forwards the query to the relevant shard groups.
Each shard returns a fixed‑size ordered batch.
Proxy stores the batch in the shard’s sort buffer.
Proxy merges buffers using a heap: the smallest row is sent to the client, and the corresponding buffer is refilled from its shard when depleted.
The loop repeats until the client’s limit (e.g., 1 million rows) is reached.
Analysis of the Final Scheme
The design addresses three limitations of the previous disk‑based approach:
Disk capacity : No shard data is persisted on the proxy’s disk, eliminating storage exhaustion.
Disk I/O : Data stays in memory buffers; only small batches are read from shards, virtually removing disk I/O.
Network bandwidth : The proxy pulls only as many rows as needed to fill the buffers, avoiding the transfer of the full result set from every shard.
Assuming a per‑shard sort buffer of 2 MB and 50 shards, the worst‑case data transferred is 2 MB × 50 + 1 million rows, while the best case approaches 1 million + 50 rows when data is evenly distributed.
Usage Constraints
Shard nodes must support local sorting (e.g., MySQL, PostgreSQL).
Shards must allow batch reads, such as streaming or cursor queries.
For databases that retain query context across execution, batch reads can be even more efficient.
References
JDBC操作MySQL(3)—查询
MySQL JDBC StreamResult通信原理浅析
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
