Big Data 12 min read

Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch

This article examines several engineering solutions for extracting up to 100,000 records from a ten‑million‑row pool, comparing multi‑threaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑plus‑HBase hybrid, and RediSearch + RedisJSON, and presents performance measurements and practical trade‑offs.

Code Ape Tech Column

Dec 5, 2025

Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch

Problem Statement

Retrieve at most 100 000 rows from a pool table containing tens of millions of records, then sort and shuffle the result according to configurable weight rules.

Initial Design – ClickHouse Multithreaded Pagination

1. A daily ETL job copies the Hive pool (≈10 M rows) into ClickHouse for low‑latency queries.

2. Business filter and sort rules are encapsulated in a SelectionQueryCondition object.

3. When reading from ClickHouse, a thread pool launches one task per page. Each task executes a paginated query and appends its result to a shared result list.

4. After all pages are collected, the result list is sorted to produce the final ordered output.

int pageSize = this.getPageSize();
int pageCnt = totalNum / pageSize + 1;
List<Map<String, Object>> result = new ArrayList<>();
List<Future<List<Map<String, Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition cond = buildSelectionQueryCondition(ruleData);
    cond.setPageSize(pageSize);
    cond.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(cond)));
}
for (Future<List<Map<String, Object>>> f : futureList) {
    List<Map<String, Object>> page = f.get(20, TimeUnit.SECONDS);
    if (page != null && !page.isEmpty()) {
        result.addAll(page);
    }
}
// sort result according to weight rules

ClickHouse pagination uses LIMIT #{limitStart}, #{limitEnd}. Deep pagination (e.g., retrieving 100 K rows from a 10 M pool) incurs 10–18 s latency, which is sub‑optimal.

Elasticsearch Scroll‑Scan Optimization

Elasticsearch offers several deep‑pagination strategies:

from + size

scroll

scroll‑scan

search‑after

Benchmark results (illustrated in the images) show that for result sets under ~30 K rows, scroll‑scan outperforms ClickHouse; for larger sets, ClickHouse multithreading is faster.

Performance Comparison

Tests on 100 K, 1 M and 10 M‑row pools show:

Latency grows with pool size for a fixed result size.

ES scroll‑scan is faster when the result set is ≤30 K rows.

ClickHouse multithreaded pagination is faster when the result set exceeds ~50 K rows.

Hybrid ES + HBase Query

To reduce the amount of data transferred from the pool, Elasticsearch is used only for filtering. The query returns the unique identifier sku_id (and the internal doc_id). HBase, a column‑store with O(1) row‑key lookup, retrieves the remaining fields (price, stock, etc.) using sku_id as the row key.

Gray‑testing on a 10 M‑row pool with a 100 K target set reduced worst‑case latency from 10–18 s (ClickHouse) to 3–6 s.

Note: This approach requires data duplication in both ES and HBase, increasing system complexity.

RediSearch + RedisJSON Optimization

RediSearch provides a distributed full‑text search engine on top of Redis, while RedisJSON adds native JSON storage and indexing. Together they enable very high‑throughput search operations.

RediSearch Performance

Indexing 5.6 M documents (5.3 GB) took 221 s with RediSearch versus 349 s with Elasticsearch (58 % faster). Under a 32‑client load, RediSearch achieved 12.5 K ops/sec versus 3.1 K ops/sec for Elasticsearch, with latency 8 ms vs 10 ms.

RediSearch vs Elasticsearch indexing and throughput

RedisJSON Performance

Isolated writes: 5.4× faster than MongoDB, >200× faster than Elasticsearch.

Isolated reads: 12.7× faster than MongoDB, >500× faster than Elasticsearch.

Mixed workloads: ~50× higher ops/sec than MongoDB and 7× higher than Elasticsearch; latency 90× lower than MongoDB and 23.7× lower than Elasticsearch.

RedisJSON maintains stable latency under high write ratios, whereas Elasticsearch throughput degrades.

Summary of Latency for 100 K Target Set

ClickHouse multithreaded pagination: 10 s – 18 s (worst case).

Elasticsearch scroll‑scan (single‑thread): comparable to ClickHouse, no clear advantage.

ES + HBase hybrid: 3 s – 6 s (worst case).

RediSearch + RedisJSON: promising benchmarks; real‑world latency pending.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

big data Elasticsearch clickhouse HBase pagination RediSearch

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Statement

Initial Design – ClickHouse Multithreaded Pagination

Elasticsearch Scroll‑Scan Optimization

Performance Comparison

Hybrid ES + HBase Query

RediSearch + RedisJSON Optimization

RediSearch Performance

RedisJSON Performance

Summary of Latency for 100 K Target Set

Code Ape Tech Column

How this landed with the community

Was this worth your time?

0 Comments

Hybrid ES + HBase Query

RediSearch + RedisJSON Optimization

Summary of Latency for 100 K Target Set