Optimizing Large‑Scale Data Pagination with ClickHouse, Elasticsearch, HBase, and Redis
This article presents a comprehensive analysis and multiple optimization strategies—including multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid approach, and RediSearch + RedisJSON—to efficiently filter and sort up to 10 W records from a pool of tens of millions while reducing query latency and system complexity.
In response to a business requirement to select no more than 100 K items from a pool of tens of millions and then sort and de‑duplicate them, several design schemes are explored.
Initial Design : The workflow first filters the raw pool using configurable rules, then applies sorting according to weight configurations, finally producing the result set.
Multithreaded ClickHouse Pagination : Data is imported nightly from Hive into ClickHouse. A SelectionQueryCondition object encapsulates filter and sort criteria. Multiple threads execute paginated queries against ClickHouse, collecting partial results into a shared list. Sample pagination code:
int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List<Future<List<Map<String,Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition condition = buildSelectionQueryCondition(ruleData);
condition.setPageSize(pageSize);
condition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(condition)));
}
for (Future<List<Map<String,Object>>> future : futureList) {
List<Map<String,Object>> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
result.addAll(queryRes);
}
}The final step sorts the aggregated result list to obtain the ordered output.
ClickHouse Pagination Limitations : Using LIMIT #{limitStart}, #{limitEnd} leads to performance degradation for deep pagination, with worst‑case latency of 10‑18 seconds on a 10 M pool.
Elasticsearch Scroll‑Scan Optimization : To mitigate deep‑pagination costs, the scroll API with scan mode is employed. However, the single‑threaded implementation yields similar latency to the ClickHouse approach.
ES + HBase Hybrid Scheme : ES is used solely for filtering, returning only the unique sku_id. HBase then retrieves the required fields by row key in O(1) time. This combination reduces worst‑case latency to 3‑6 seconds for the 10 M pool scenario.
RediSearch + RedisJSON Proposal : RediSearch provides fast full‑text search, while RedisJSON stores the document payload. Benchmarks show RediSearch indexing 5.6 M documents 58 % faster than Elasticsearch and delivering 4× higher query throughput. RedisJSON further outperforms MongoDB and ES in both isolated reads/writes and mixed workloads.
Conclusion : For massive data sets, multithreaded ClickHouse pagination works but suffers from deep‑page penalties; Elasticsearch scroll‑scan offers limited gains; the ES + HBase hybrid approach delivers the best latency at the cost of added system complexity; and RediSearch + RedisJSON shows promise for future iterations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
