Backend Development 11 min read

Optimizing Large‑Scale Data Retrieval with ClickHouse, Elasticsearch Scroll Scan, ES+HBase, and RediSearch+RedisJSON

This article examines a business requirement to filter up to 100 000 records from a pool of tens of millions, presenting and evaluating four backend solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—along with performance data and implementation details.

Selected Java Interview Questions

Dec 29, 2022

Optimizing Large‑Scale Data Retrieval with ClickHouse, Elasticsearch Scroll Scan, ES+HBase, and RediSearch+RedisJSON

Introduction

When a business need arises to select no more than 100 K items from a pool of tens of millions and then sort and de‑duplicate them according to weight rules, several backend strategies can be employed. The article outlines four candidate solutions and their optimizations.

Initial Design

The basic workflow consists of two steps: first, filter the target data from the pool table based on configurable selection rules; second, sort the filtered data according to configurable ranking rules.

The technical plan includes:

Import the massive pool data (stored in a Hive table) into ClickHouse daily for fast querying.

Encapsulate the selection and sorting criteria into a SelectionQueryCondition object.

Use multithreaded pagination on the ClickHouse pool table to retrieve target records and store them in a result list.

Sort the result list to produce the final output.

int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List<Map<String, Object>> result = Lists.newArrayList();
List<Future<List<Map<String, Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
    selectionQueryCondition.setPageSize(pageSize);
    selectionQueryCondition.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future<List<Map<String, Object>>> future : futureList) {
    List<Map<String, Object>> queryRes = future.get(20, TimeUnit.SECONDS);
    if (CollectionUtils.isNotEmpty(queryRes)) {
        result.addAll(queryRes);
    }
}

ClickHouse Pagination

The multithreaded pagination method wraps a queryPoolSkuList call that ultimately invokes sqlSession.selectList on the ClickHouse table. Pagination is achieved with the limit #{limitStart}, #{limitEnd} clause, which can become a performance bottleneck for deep pagination (10 s–18 s for 10 % of a 10 M‑row pool).

public List<Map<String, Object>> queryPoolSkuList(Map<String, Object> params) {
    List<Map<String, Object>> resultMaps = new ArrayList<>();
    QueryCondition queryCondition = parseQueryCondition(params);
    List<Map<String, Object>> mapList = lianNuDao.queryPoolSkuList(getCkDt(), queryCondition);
    if (CollectionUtils.isNotEmpty(mapList)) {
        for (Map<String, Object> data : mapList) {
            resultMaps.add(camelKey(data));
        }
    }
    return resultMaps;
}

Elasticsearch Scroll‑Scan Optimization

To alleviate ClickHouse deep‑pagination latency, an Elasticsearch scroll scan approach is introduced. Various ES paging methods (from+size, scroll, scroll‑scan, search‑after) are compared, showing that for result sets under 30 K, ES outperforms ClickHouse, while for larger sets ClickHouse multithreading is faster.

ES + HBase Hybrid Query

The hybrid solution uses Elasticsearch solely for filtering, returning only the unique identifiers ( sku_id and doc_id). HBase then retrieves the remaining fields in O(1) time using the sku_id as the row key. Field reduction (e.g., fetching only sku_id and price) dramatically cuts query time from 9.3 s to 2.4 s.

RediSearch + RedisJSON

RediSearch, built on Redis, together with RedisJSON, offers superior indexing and query performance. Benchmarks show RediSearch indexing 5.6 M documents in 221 s (58 % faster than ES) and achieving 12.5 K ops/sec throughput versus 3.1 K ops/sec for ES, with lower latency (8 ms vs 10 ms). RedisJSON also outperforms MongoDB and ES by large margins in isolated read/write workloads.

Conclusion

The article compares four solutions for extracting ~100 K records from a 10 M‑row pool:

Multithreaded ClickHouse pagination: worst‑case 10 s–18 s.

Single‑threaded ES scroll‑scan: no clear advantage over ClickHouse.

ES + HBase hybrid: worst‑case reduced to 3 s–6 s.

RediSearch + RedisJSON: promising performance, pending real‑world testing.

While the hybrid approach yields the best latency, it introduces additional system complexity due to dual data storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend ClickHouse HBase Data Retrieval

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Initial Design

ClickHouse Pagination

Elasticsearch Scroll‑Scan Optimization

ES + HBase Hybrid Query

RediSearch + RedisJSON

Conclusion

Selected Java Interview Questions

How this landed with the community

Was this worth your time?

0 Comments

ES + HBase Hybrid Query

RediSearch + RedisJSON