Design and Optimization of Querying 100K Records from Tens of Millions of Data Using ClickHouse, Elasticsearch, HBase, and RediSearch
This article presents a comprehensive design and performance‑optimization study for extracting up to 100 000 records from a pool of tens of millions, comparing multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, ES + HBase, and RediSearch + RedisJSON solutions, and provides practical recommendations based on measured latency and throughput.
In response to a business requirement to filter no more than 100 000 items from a pool of tens of millions and then sort and de‑duplicate them, several technical solutions are explored and evaluated.
Initial design separates the process into two steps: (1) select target data according to configurable filter rules, and (2) sort the selected data according to configurable ranking rules.
The first implementation imports the Hive pool table into ClickHouse daily and performs multithreaded pagination on the ClickHouse table. The core Java code creates a SelectionQueryCondition object, launches a thread pool, and aggregates results into a result list:
int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List
> result = Lists.newArrayList();
List
>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
selectionQueryCondition.setPageSize(pageSize);
selectionQueryCondition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future
>> future : futureList) {
List
> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
result.addAll(queryRes);
}
}ClickHouse pagination relies on the SQL limit #{limitStart},#{limitEnd} clause, which shows performance degradation (10–18 s) when deep paging over 10 M‑level pools.
Elasticsearch scroll‑scan optimization replaces the deep‑paging limit with the scroll scan API. Various ES paging strategies (from/size, scroll, scroll‑scan, search‑after) are listed, and benchmark images illustrate that ES outperforms ClickHouse for result sets under 30 K, while ClickHouse multithreading wins for larger result sets.
ES + HBase combination addresses the limitation of single‑threaded ES scroll‑scan by using ES only for filtering (returning sku_id and doc_id ) and HBase for fast O(1) retrieval of the remaining fields via the rowKey. This reduces query time on a 10 M‑level pool from 10–18 s to 3–6 s.
RediSearch + RedisJSON is introduced as a future high‑performance alternative. Reported benchmarks show RediSearch indexing 5.6 M documents in 221 s (58 % faster than ES) and achieving 12.5 K ops/sec throughput versus 3.1 K ops/sec for ES. RedisJSON demonstrates orders‑of‑magnitude speed advantages over MongoDB and ES for isolated reads/writes.
Conclusion summarizes the measured worst‑case latencies for each approach: multithreaded ClickHouse (10–18 s), single‑threaded ES scroll‑scan (no clear gain), ES + HBase (3–6 s), and RediSearch + RedisJSON (to be measured). The article also notes the increased system complexity of the ES + HBase solution.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.