Efficient 100K‑Record Queries on 10M‑Scale Data: ClickHouse, ES Scroll, ES+HBase
To retrieve up to 100 000 items from a pool of tens of millions, the article compares multi‑threaded ClickHouse pagination, Elasticsearch scroll‑scan deep paging, a combined ES‑HBase approach, and RediSearch + RedisJSON, detailing design, implementation code, performance benchmarks, and trade‑offs.
Introduction
When a business requirement demands selecting no more than 100 000 records from a pool of tens of millions and then sorting and shuffling them according to weighted rules, several implementation designs and optimizations are explored.
Initial Design
The overall design consists of two steps: first filter the "target data" from the pool based on configurable filter rules, then sort the target data according to configurable sort rules to obtain the "result data".
Technical steps:
Run a daily extraction task that imports the existing tens‑of‑millions‑row Hive pool into ClickHouse for later filtering.
Build a SelectionQueryCondition object that encapsulates filter and sort rules.
When fetching target data from the ClickHouse pool, enable multithreading and pagination, storing the retrieved records in a result list.
Sort the result list to produce the final result set.
// pagination size, default 5000
int pageSize = this.getPageSize();
// page count
int pageCnt = totalNum / this.getPageSize() + 1;
List<Map<String, Object>> result = Lists.newArrayList();
List<Future<List<Map<String, Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
selectionQueryCondition.setPageSize(pageSize);
selectionQueryCondition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future<List<Map<String, Object>>> future : futureList) {
List<Map<String, Object>> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
result.addAll(queryRes);
}
}ClickHouse Pagination
The multithreaded pagination method queries ClickHouse with LIMIT #{limitStart}, #{limitEnd}. In deep pagination scenarios, this approach can take 10–18 seconds for 10 M‑scale pools.
Elasticsearch Scroll‑Scan Optimization
To address ClickHouse deep‑paging performance issues, an Elasticsearch scroll‑scan pagination scheme is introduced. ES provides several pagination methods: from + size , scroll , scroll‑scan , and search after .
Performance tests show that for result sets under 30 K, ES outperforms ClickHouse, while for result sets above 50 K, ClickHouse’s multithreaded approach is faster.
ES + HBase Combined Query
Because each record can contain many fields (up to 150), reducing the number of fetched fields significantly cuts query time. The combined approach uses ES solely for filtering, returning only the unique sku_id (and ES doc_id), then retrieves remaining fields from HBase using the sku_id as the row key (O(1) lookup).
In a gray‑scale test on a 10 M‑scale pool, this method reduced worst‑case latency from 10–18 seconds to 3–6 seconds.
RediSearch + RedisJSON Optimization
RediSearch, a distributed full‑text search engine built on Redis, together with RedisJSON, offers high‑performance search on JSON documents. Benchmarks show RediSearch indexing is 58 % faster than Elasticsearch, and query throughput is up to 4 × higher with lower latency.
Performance data:
Indexing 5.6 M documents: RediSearch 221 s vs Elasticsearch 349 s.
Throughput: 12.5 K ops/s (RediSearch) vs 3.1 K ops/s (ES).
Latency: 8 ms (RediSearch) vs 10 ms (ES).
RedisJSON outperforms MongoDB and ES by large margins in isolated reads/writes.
Summary
The article evaluates four solutions for extracting 100 K records from a 10 M‑scale pool:
Multithreaded ClickHouse pagination – worst‑case 10 s ~ 18 s.
Single‑threaded ES scroll‑scan – no clear advantage over ClickHouse.
ES + HBase combined approach – worst‑case reduced to 3 s ~ 6 s.
RediSearch + RedisJSON – promising performance, to be validated in future tests.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
