Databases 10 min read

Optimizing Distributed Database Sorting with Proxy Buffers and Priority Queues

This article examines the architecture of distributed databases, identifies the challenges of global sorting across shards, and presents both in‑memory and disk‑based proxy sorting solutions, detailing buffer configuration, merge‑sort and priority‑queue techniques, performance trade‑offs, and practical constraints for large‑scale queries.

dbaplus Community

Mar 11, 2023

Optimizing Distributed Database Sorting with Proxy Buffers and Priority Queues

Background

Current distributed database architectures consist of coordinator nodes, data shards, metadata nodes, and a global clock. Typical components include GTM (global transaction manager), catalog, groups (horizontal shards), and proxy (stateless coordinator).

Sorting Challenge

Sorting queries across shards requires the proxy to merge ordered results from each shard. For small result sets, data can be kept in memory; for large sets, memory limits cause OOM and disk I/O becomes a bottleneck.

Initial In‑Memory Solution

Each shard performs local sorting; the proxy merges using merge sort or a priority‑queue. The proxy’s sort buffer (default 10 MiB) is divided among N shards (10 MiB/N). This works only for modest data volumes.

Limitations of In‑Memory Approach

Increasing the sort buffer consumes more memory, which cannot be scaled indefinitely.

Optimized Disk‑Based Sorting

Store each shard’s ordered output on disk, then perform external merge using a priority‑queue heap. The proxy reads a fixed amount from each shard into a per‑shard sort buffer, builds a heap of one entry per shard, and repeatedly extracts the smallest entry, refilling buffers as needed.

Steps:

Client sends SELECT * FROM t1 ORDER BY id to proxy.

Proxy forwards the query to relevant shards.

Each shard sorts locally and returns a fixed‑size ordered batch.

Proxy stores batches in per‑shard sort buffers on disk.

Proxy merges using a priority‑queue heap, sending results to client.

When a buffer is exhausted, proxy reads the next batch from the shard.

Process continues until all data are sent.

Analysis of Drawbacks

Proxy still needs enough disk space for buffered data; extremely large datasets may exceed capacity.

Disk I/O can be high if buffers are small.

Fetching full result sets from all shards wastes bandwidth; the optimized method fetches only as much as needed to satisfy the client limit.

Performance Scenarios

Assuming a 2 MiB sort buffer per shard and 50 shards, the worst case pulls 2 MiB × 50 + 100 W rows, while the best case pulls roughly 100 W + 50 rows when data are evenly distributed.

Applicability Conditions

Shards must support ordered queries.

Shards must allow batch reads (e.g., MySQL streaming or cursor queries).

References

JDBC操作MySQL（3）—查询 https://www.jianshu.com/p/c7c5dbe63019

MySQL JDBC StreamResult通信原理浅析 https://blog.csdn.net/xieyuooo/article/details/83109971/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance proxy priority-queue distributed databases merge sort sorting optimization external merge

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.