Big Data 12 min read

Optimizing 10K‑Record Queries from Tens of Millions: CK, ES, HBase & Redis Strategies

This article examines a real‑world requirement to extract no more than 100 000 rows from a pool of tens of millions, comparing multithreaded ClickHouse pagination, Elasticsearch scroll‑scan deep paging, an ES‑HBase hybrid query, and a RediSearch‑RedisJSON approach, and presents performance measurements and practical conclusions.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Optimizing 10K‑Record Queries from Tens of Millions: CK, ES, HBase & Redis Strategies

Problem Statement

A production system needs to select at most 100 000 items from a pool of tens of millions of records and then sort and de‑duplicate them according to configurable weight rules (e.g., no three consecutive items of the same category).

Initial Design – Multithreaded ClickHouse Pagination

Data from a Hive table is loaded daily into ClickHouse (CK). A SelectionQueryCondition object encapsulates filtering and sorting rules. The CK pool table is read with multiple threads, each thread fetching a page of results. All pages are merged into a result list and finally sorted.

// pagination size, default 5000
int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List<Map<String,Object>> result = Lists.newArrayList();
List<Future<List<Map<String,Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition cond = buildSelectionQueryCondition(selectionQueryRuleData);
    cond.setPageSize(pageSize);
    cond.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(cond)));
}
for (Future<List<Map<String,Object>>> f : futureList) {
    List<Map<String,Object>> page = f.get(20, TimeUnit.SECONDS);
    if (CollectionUtils.isNotEmpty(page)) {
        result.addAll(page);
    }
}

Performance on a 10 M‑row pool: worst‑case latency 10 s – 18 s.

Optimization 1 – Elasticsearch Scroll‑Scan Deep Paging

To avoid the deep‑paging overhead of CK, Elasticsearch (ES) scroll API is used. Scroll retrieves large result sets in batches without offset. The implementation remained single‑threaded, so latency improvement was modest.

ES Pagination Options

from + size

scroll

scroll + scan

search_after

Benchmarks (not shown) indicated that for result sets < 30 K ES outperforms CK, while for > 50 K CK’s multithreaded pagination is faster.

Optimization 2 – ES + HBase Hybrid Query

ES is used only for filtering, returning the unique identifier sku_id (and internal doc_id). HBase, a column‑store, retrieves the remaining fields by primary‑key lookup (O(1) complexity).

Workflow

Execute an ES query to obtain a list of sku_id.

For each sku_id, issue an HBase Get to fetch required columns such as price, stock, etc.

In a gray‑scale production test on a 10 M‑row pool, worst‑case latency dropped to 3 s – 6 s, a significant improvement over the original CK approach.

Optimization 3 – RediSearch + RedisJSON

RediSearch provides full‑text search and aggregation on top of Redis, while RedisJSON adds native JSON storage and indexing.

Performance Highlights

Indexing 5.6 M documents: RediSearch 221 s vs. ES 349 s (≈ 58 % faster).

Throughput for two‑term queries with 32 clients: RediSearch 12.5 K ops/s vs. ES 3.1 K ops/s (≈ 4×).

Latency: RediSearch 8 ms vs. ES 10 ms.

In mixed‑load scenarios, RedisJSON outperforms MongoDB and ES in both throughput and latency.

Adopting this solution increases system complexity and requires data duplication across ES and Redis.

Conclusions

Multithreaded ClickHouse pagination is reliable for result sets > 50 K but suffers from deep‑paging overhead.

Elasticsearch scroll‑scan reduces latency for smaller result sets, but single‑threaded execution limits its benefit.

Combining ES for filtering with HBase for fast key‑value lookups yields the best performance for large pools (3 s – 6 s worst case).

RediSearch + RedisJSON shows promising raw performance, though integration cost must be weighed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceOptimizationElasticsearchclickhouseHBaseRedisJSONRediSearchLargeScaleQuery
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.