Databases 10 min read

Optimizing Distributed Database Sorting with Proxy Buffers and Priority Queues

This article examines the architecture of distributed databases, identifies the challenges of global sorting across shards, and presents both in‑memory and disk‑based proxy sorting solutions, detailing buffer configuration, merge‑sort and priority‑queue techniques, performance trade‑offs, and practical constraints for large‑scale queries.

dbaplus Community
dbaplus Community
dbaplus Community
Optimizing Distributed Database Sorting with Proxy Buffers and Priority Queues

Background

Current distributed database architectures consist of coordinator nodes, data shards, metadata nodes, and a global clock. Typical components include GTM (global transaction manager), catalog, groups (horizontal shards), and proxy (stateless coordinator).

Distributed architecture diagram
Distributed architecture diagram

Sorting Challenge

Sorting queries across shards requires the proxy to merge ordered results from each shard. For small result sets, data can be kept in memory; for large sets, memory limits cause OOM and disk I/O becomes a bottleneck.

Initial In‑Memory Solution

Each shard performs local sorting; the proxy merges using merge sort or a priority‑queue. The proxy’s sort buffer (default 10 MiB) is divided among N shards (10 MiB/N). This works only for modest data volumes.

Limitations of In‑Memory Approach

Increasing the sort buffer consumes more memory, which cannot be scaled indefinitely.

Optimized Disk‑Based Sorting

Store each shard’s ordered output on disk, then perform external merge using a priority‑queue heap. The proxy reads a fixed amount from each shard into a per‑shard sort buffer, builds a heap of one entry per shard, and repeatedly extracts the smallest entry, refilling buffers as needed.

Steps:

Client sends SELECT * FROM t1 ORDER BY id to proxy.

Proxy forwards the query to relevant shards.

Each shard sorts locally and returns a fixed‑size ordered batch.

Proxy stores batches in per‑shard sort buffers on disk.

Proxy merges using a priority‑queue heap, sending results to client.

When a buffer is exhausted, proxy reads the next batch from the shard.

Process continues until all data are sent.

In‑memory sorting flow
In‑memory sorting flow
Disk‑based sorting flow
Disk‑based sorting flow

Analysis of Drawbacks

Proxy still needs enough disk space for buffered data; extremely large datasets may exceed capacity.

Disk I/O can be high if buffers are small.

Fetching full result sets from all shards wastes bandwidth; the optimized method fetches only as much as needed to satisfy the client limit.

Performance Scenarios

Assuming a 2 MiB sort buffer per shard and 50 shards, the worst case pulls 2 MiB × 50 + 100 W rows, while the best case pulls roughly 100 W + 50 rows when data are evenly distributed.

Applicability Conditions

Shards must support ordered queries.

Shards must allow batch reads (e.g., MySQL streaming or cursor queries).

References

JDBC操作MySQL(3)—查询 https://www.jianshu.com/p/c7c5dbe63019

MySQL JDBC StreamResult通信原理浅析 https://blog.csdn.net/xieyuooo/article/details/83109971/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceProxypriority-queuedistributed databasesmerge sortsorting optimizationexternal merge
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.