Big Data 12 min read

Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch

This article examines several engineering solutions for extracting up to 100,000 records from a ten‑million‑row pool, comparing multi‑threaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑plus‑HBase hybrid, and RediSearch + RedisJSON, and presents performance measurements and practical trade‑offs.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch

Problem Statement

Retrieve at most 100 000 rows from a pool table containing tens of millions of records, then sort and shuffle the result according to configurable weight rules.

Initial Design – ClickHouse Multithreaded Pagination

1. A daily ETL job copies the Hive pool (≈10 M rows) into ClickHouse for low‑latency queries.

2. Business filter and sort rules are encapsulated in a SelectionQueryCondition object.

3. When reading from ClickHouse, a thread pool launches one task per page. Each task executes a paginated query and appends its result to a shared result list.

4. After all pages are collected, the result list is sorted to produce the final ordered output.

int pageSize = this.getPageSize();
int pageCnt = totalNum / pageSize + 1;
List<Map<String, Object>> result = new ArrayList<>();
List<Future<List<Map<String, Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition cond = buildSelectionQueryCondition(ruleData);
    cond.setPageSize(pageSize);
    cond.setPage(i);
    futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(cond)));
}
for (Future<List<Map<String, Object>>> f : futureList) {
    List<Map<String, Object>> page = f.get(20, TimeUnit.SECONDS);
    if (page != null && !page.isEmpty()) {
        result.addAll(page);
    }
}
// sort result according to weight rules

ClickHouse pagination uses LIMIT #{limitStart}, #{limitEnd}. Deep pagination (e.g., retrieving 100 K rows from a 10 M pool) incurs 10–18 s latency, which is sub‑optimal.

Elasticsearch Scroll‑Scan Optimization

Elasticsearch offers several deep‑pagination strategies:

from + size

scroll

scroll‑scan

search‑after

Benchmark results (illustrated in the images) show that for result sets under ~30 K rows, scroll‑scan outperforms ClickHouse; for larger sets, ClickHouse multithreading is faster.

Elasticsearch pagination options
Elasticsearch pagination options
Performance of ES pagination strategies
Performance of ES pagination strategies

Performance Comparison

Tests on 100 K, 1 M and 10 M‑row pools show:

Latency grows with pool size for a fixed result size.

ES scroll‑scan is faster when the result set is ≤30 K rows.

ClickHouse multithreaded pagination is faster when the result set exceeds ~50 K rows.

Latency comparison chart
Latency comparison chart

Hybrid ES + HBase Query

To reduce the amount of data transferred from the pool, Elasticsearch is used only for filtering. The query returns the unique identifier sku_id (and the internal doc_id). HBase, a column‑store with O(1) row‑key lookup, retrieves the remaining fields (price, stock, etc.) using sku_id as the row key.

ES + HBase architecture
ES + HBase architecture

Gray‑testing on a 10 M‑row pool with a 100 K target set reduced worst‑case latency from 10–18 s (ClickHouse) to 3–6 s.

Note: This approach requires data duplication in both ES and HBase, increasing system complexity.

RediSearch + RedisJSON Optimization

RediSearch provides a distributed full‑text search engine on top of Redis, while RedisJSON adds native JSON storage and indexing. Together they enable very high‑throughput search operations.

RediSearch Performance

Indexing 5.6 M documents (5.3 GB) took 221 s with RediSearch versus 349 s with Elasticsearch (58 % faster). Under a 32‑client load, RediSearch achieved 12.5 K ops/sec versus 3.1 K ops/sec for Elasticsearch, with latency 8 ms vs 10 ms.

RediSearch vs Elasticsearch indexing and throughput
RediSearch vs Elasticsearch indexing and throughput

RedisJSON Performance

Isolated writes: 5.4× faster than MongoDB, >200× faster than Elasticsearch.

Isolated reads: 12.7× faster than MongoDB, >500× faster than Elasticsearch.

Mixed workloads: ~50× higher ops/sec than MongoDB and 7× higher than Elasticsearch; latency 90× lower than MongoDB and 23.7× lower than Elasticsearch.

RedisJSON maintains stable latency under high write ratios, whereas Elasticsearch throughput degrades.

Summary of Latency for 100 K Target Set

ClickHouse multithreaded pagination: 10 s – 18 s (worst case).

Elasticsearch scroll‑scan (single‑thread): comparable to ClickHouse, no clear advantage.

ES + HBase hybrid: 3 s – 6 s (worst case).

RediSearch + RedisJSON: promising benchmarks; real‑world latency pending.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataElasticsearchClickHouseHBasePaginationRediSearch
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.