Big Data 11 min read

Efficient 100K‑Record Queries on 10M‑Scale Data: ClickHouse, ES Scroll, ES+HBase

To retrieve up to 100 000 items from a pool of tens of millions, the article compares multi‑threaded ClickHouse pagination, Elasticsearch scroll‑scan deep paging, a combined ES‑HBase approach, and RediSearch + RedisJSON, detailing design, implementation code, performance benchmarks, and trade‑offs.

Architect's Guide
Architect's Guide
Architect's Guide
Efficient 100K‑Record Queries on 10M‑Scale Data: ClickHouse, ES Scroll, ES+HBase

Introduction

When a business requirement demands selecting no more than 100 000 records from a pool of tens of millions and then sorting and shuffling them according to weighted rules, several implementation designs and optimizations are explored.

Initial Design

The overall design consists of two steps: first filter the "target data" from the pool based on configurable filter rules, then sort the target data according to configurable sort rules to obtain the "result data".

Technical steps:

Run a daily extraction task that imports the existing tens‑of‑millions‑row Hive pool into ClickHouse for later filtering.

Build a SelectionQueryCondition object that encapsulates filter and sort rules.

When fetching target data from the ClickHouse pool, enable multithreading and pagination, storing the retrieved records in a result list.

Sort the result list to produce the final result set.

// pagination size, default 5000
int pageSize = this.getPageSize();
// page count
int pageCnt = totalNum / this.getPageSize() + 1;
List<Map<String, Object>> result = Lists.newArrayList();
List<Future<List<Map<String, Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
    selectionQueryCondition.setPageSize(pageSize);
    selectionQueryCondition.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future<List<Map<String, Object>>> future : futureList) {
    List<Map<String, Object>> queryRes = future.get(20, TimeUnit.SECONDS);
    if (CollectionUtils.isNotEmpty(queryRes)) {
        result.addAll(queryRes);
    }
}

ClickHouse Pagination

The multithreaded pagination method queries ClickHouse with LIMIT #{limitStart}, #{limitEnd}. In deep pagination scenarios, this approach can take 10–18 seconds for 10 M‑scale pools.

Elasticsearch Scroll‑Scan Optimization

To address ClickHouse deep‑paging performance issues, an Elasticsearch scroll‑scan pagination scheme is introduced. ES provides several pagination methods: from + size , scroll , scroll‑scan , and search after .

Performance tests show that for result sets under 30 K, ES outperforms ClickHouse, while for result sets above 50 K, ClickHouse’s multithreaded approach is faster.

ES + HBase Combined Query

Because each record can contain many fields (up to 150), reducing the number of fetched fields significantly cuts query time. The combined approach uses ES solely for filtering, returning only the unique sku_id (and ES doc_id), then retrieves remaining fields from HBase using the sku_id as the row key (O(1) lookup).

In a gray‑scale test on a 10 M‑scale pool, this method reduced worst‑case latency from 10–18 seconds to 3–6 seconds.

RediSearch + RedisJSON Optimization

RediSearch, a distributed full‑text search engine built on Redis, together with RedisJSON, offers high‑performance search on JSON documents. Benchmarks show RediSearch indexing is 58 % faster than Elasticsearch, and query throughput is up to 4 × higher with lower latency.

Performance data:

Indexing 5.6 M documents: RediSearch 221 s vs Elasticsearch 349 s.

Throughput: 12.5 K ops/s (RediSearch) vs 3.1 K ops/s (ES).

Latency: 8 ms (RediSearch) vs 10 ms (ES).

RedisJSON outperforms MongoDB and ES by large margins in isolated reads/writes.

Summary

The article evaluates four solutions for extracting 100 K records from a 10 M‑scale pool:

Multithreaded ClickHouse pagination – worst‑case 10 s ~ 18 s.

Single‑threaded ES scroll‑scan – no clear advantage over ClickHouse.

ES + HBase combined approach – worst‑case reduced to 3 s ~ 6 s.

RediSearch + RedisJSON – promising performance, to be validated in future tests.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

redisquery optimizationClickHouseHBase
Architect's Guide
Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.