Optimizing Large‑Scale Data Retrieval with ClickHouse, Elasticsearch Scroll Scan, ES+HBase, and RediSearch+RedisJSON
This article examines a business requirement to filter up to 100 000 records from a pool of tens of millions, presenting and evaluating four backend solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—along with performance data and implementation details.
Introduction
When a business need arises to select no more than 100 K items from a pool of tens of millions and then sort and de‑duplicate them according to weight rules, several backend strategies can be employed. The article outlines four candidate solutions and their optimizations.
Initial Design
The basic workflow consists of two steps: first, filter the target data from the pool table based on configurable selection rules; second, sort the filtered data according to configurable ranking rules.
The technical plan includes:
Import the massive pool data (stored in a Hive table) into ClickHouse daily for fast querying.
Encapsulate the selection and sorting criteria into a SelectionQueryCondition object.
Use multithreaded pagination on the ClickHouse pool table to retrieve target records and store them in a result list.
Sort the result list to produce the final output.
int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List
> result = Lists.newArrayList();
List
>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
selectionQueryCondition.setPageSize(pageSize);
selectionQueryCondition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future
>> future : futureList) {
List
> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
result.addAll(queryRes);
}
}ClickHouse Pagination
The multithreaded pagination method wraps a queryPoolSkuList call that ultimately invokes sqlSession.selectList on the ClickHouse table. Pagination is achieved with the limit #{limitStart}, #{limitEnd} clause, which can become a performance bottleneck for deep pagination (10 s–18 s for 10 % of a 10 M‑row pool).
public List
> queryPoolSkuList(Map
params) {
List
> resultMaps = new ArrayList<>();
QueryCondition queryCondition = parseQueryCondition(params);
List
> mapList = lianNuDao.queryPoolSkuList(getCkDt(), queryCondition);
if (CollectionUtils.isNotEmpty(mapList)) {
for (Map
data : mapList) {
resultMaps.add(camelKey(data));
}
}
return resultMaps;
}Elasticsearch Scroll‑Scan Optimization
To alleviate ClickHouse deep‑pagination latency, an Elasticsearch scroll scan approach is introduced. Various ES paging methods (from+size, scroll, scroll‑scan, search‑after) are compared, showing that for result sets under 30 K, ES outperforms ClickHouse, while for larger sets ClickHouse multithreading is faster.
ES + HBase Hybrid Query
The hybrid solution uses Elasticsearch solely for filtering, returning only the unique identifiers ( sku_id and doc_id ). HBase then retrieves the remaining fields in O(1) time using the sku_id as the row key. Field reduction (e.g., fetching only sku_id and price ) dramatically cuts query time from 9.3 s to 2.4 s.
RediSearch + RedisJSON
RediSearch, built on Redis, together with RedisJSON, offers superior indexing and query performance. Benchmarks show RediSearch indexing 5.6 M documents in 221 s (58 % faster than ES) and achieving 12.5 K ops/sec throughput versus 3.1 K ops/sec for ES, with lower latency (8 ms vs 10 ms). RedisJSON also outperforms MongoDB and ES by large margins in isolated read/write workloads.
Conclusion
The article compares four solutions for extracting ~100 K records from a 10 M‑row pool:
Multithreaded ClickHouse pagination: worst‑case 10 s–18 s.
Single‑threaded ES scroll‑scan: no clear advantage over ClickHouse.
ES + HBase hybrid: worst‑case reduced to 3 s–6 s.
RediSearch + RedisJSON: promising performance, pending real‑world testing.
While the hybrid approach yields the best latency, it introduces additional system complexity due to dual data storage.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.