Big Data 12 min read

Optimizing Large-Scale Data Retrieval: ClickHouse Pagination, Elasticsearch Scroll Scan, ES+HBase, and RediSearch + RedisJSON Solutions

This article examines a business requirement to filter and rank up to 100,000 records from a pool of tens of millions, presenting and evaluating four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan deep paging, an ES‑HBase combined query, and a RediSearch + RedisJSON approach—along with performance data and code examples.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Optimizing Large-Scale Data Retrieval: ClickHouse Pagination, Elasticsearch Scroll Scan, ES+HBase, and RediSearch + RedisJSON Solutions

In response to a business need to select no more than 100K items from a pool of tens of millions and then sort and de‑duplicate them according to configurable weight rules, several design ideas and optimizations are introduced.

Initial Design

The baseline plan consists of two steps: first filter the pool table according to the configured selection rules to obtain the "target data", then sort the target data according to the configured ordering rules to produce the final result set.

The technical implementation uses ClickHouse (CK) as the storage engine and proceeds as follows:

Import the raw pool data (a Hive table) into CK on a daily basis.

Wrap the business selection and ordering rules into a SelectionQueryCondition object.

When reading the CK pool table, start multiple threads, each fetching a page of data and storing the results in a shared result list.

Finally sort the result list to obtain the ordered output.

Multithreaded pagination code (simplified):

// page size, default 5000
int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;

List
> result = Lists.newArrayList();
List
>>> futureList = new ArrayList<>(pageCnt);

for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
    selectionQueryCondition.setPageSize(pageSize);
    selectionQueryCondition.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}

for (Future
>> future : futureList) {
    List
> queryRes = future.get(20, TimeUnit.SECONDS);
    if (CollectionUtils.isNotEmpty(queryRes)) {
        result.addAll(queryRes);
    }
}

CK pagination relies on LIMIT #{limitStart}, #{limitEnd} . While functional, the limit‑based approach suffers from severe performance degradation on deep paging: selecting 100K rows from a 10M‑row pool can take 10–18 seconds.

Elasticsearch Scroll‑Scan Deep Paging

To address CK's deep‑paging bottleneck, an Elasticsearch scroll‑scan strategy is evaluated. Elasticsearch offers several paging methods (from/size, scroll, scroll‑scan, search‑after). The scroll‑scan approach keeps a server‑side cursor, allowing efficient retrieval of large result sets.

Performance tests show that for result sets under ~30K rows, ES outperforms CK, but beyond ~50K rows the multithreaded CK solution becomes faster.

ES + HBase Combined Query

Because ES returns full documents, the fetch phase can dominate when many fields are requested. The combined solution uses ES solely for filtering (returning only sku_id and doc_id ), then looks up the remaining fields in HBase using the sku_id as the row key (O(1) access).

In a gray‑scale test on a 10M‑row pool, this hybrid approach reduced worst‑case latency from 10–18 seconds (CK) to 3–6 seconds, at the cost of added system complexity and dual data storage.

RediSearch + RedisJSON Optimization

RediSearch, a full‑text search engine built on Redis, together with RedisJSON (JSON support module), can deliver dramatically higher indexing and query performance. Reported benchmarks include:

Indexing 5.6 M documents: RediSearch 221 s vs Elasticsearch 349 s (58 % faster).

Search throughput: 12.5 K ops/s (RediSearch) vs 3.1 K ops/s (ES).

Latency: 8 ms (RediSearch) vs 10 ms (ES).

RedisJSON read/write speeds exceed MongoDB by >5× (writes) and >12× (reads), and surpass ES by >200× (writes) and >500× (reads).

These figures suggest RediSearch + RedisJSON could further improve the pipeline, though it has not yet been measured in the target scenario.

Summary

The article compares four solutions for extracting ~100K rows from a 10M‑row pool:

Multithreaded ClickHouse pagination – worst‑case 10 s ~ 18 s.

Single‑threaded Elasticsearch scroll‑scan – no clear advantage over CK.

ES + HBase combined query – worst‑case 3 s ~ 6 s.

RediSearch + RedisJSON – promising performance, pending real‑world testing.

Choosing the optimal approach depends on data volume, field selection, and system complexity considerations.

PerformanceOptimizationElasticsearchRedisClickHouseHBaseBigData
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.