Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch
This article examines several engineering solutions for extracting up to 100,000 records from a ten‑million‑row pool, comparing multi‑threaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑plus‑HBase hybrid, and RediSearch + RedisJSON, and presents performance measurements and practical trade‑offs.
Problem Statement
Retrieve at most 100 000 rows from a pool table containing tens of millions of records, then sort and shuffle the result according to configurable weight rules.
Initial Design – ClickHouse Multithreaded Pagination
1. A daily ETL job copies the Hive pool (≈10 M rows) into ClickHouse for low‑latency queries.
2. Business filter and sort rules are encapsulated in a SelectionQueryCondition object.
3. When reading from ClickHouse, a thread pool launches one task per page. Each task executes a paginated query and appends its result to a shared result list.
4. After all pages are collected, the result list is sorted to produce the final ordered output.
int pageSize = this.getPageSize();
int pageCnt = totalNum / pageSize + 1;
List<Map<String, Object>> result = new ArrayList<>();
List<Future<List<Map<String, Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition cond = buildSelectionQueryCondition(ruleData);
cond.setPageSize(pageSize);
cond.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(cond)));
}
for (Future<List<Map<String, Object>>> f : futureList) {
List<Map<String, Object>> page = f.get(20, TimeUnit.SECONDS);
if (page != null && !page.isEmpty()) {
result.addAll(page);
}
}
// sort result according to weight rulesClickHouse pagination uses LIMIT #{limitStart}, #{limitEnd}. Deep pagination (e.g., retrieving 100 K rows from a 10 M pool) incurs 10–18 s latency, which is sub‑optimal.
Elasticsearch Scroll‑Scan Optimization
Elasticsearch offers several deep‑pagination strategies:
from + size
scroll
scroll‑scan
search‑after
Benchmark results (illustrated in the images) show that for result sets under ~30 K rows, scroll‑scan outperforms ClickHouse; for larger sets, ClickHouse multithreading is faster.
Performance Comparison
Tests on 100 K, 1 M and 10 M‑row pools show:
Latency grows with pool size for a fixed result size.
ES scroll‑scan is faster when the result set is ≤30 K rows.
ClickHouse multithreaded pagination is faster when the result set exceeds ~50 K rows.
Hybrid ES + HBase Query
To reduce the amount of data transferred from the pool, Elasticsearch is used only for filtering. The query returns the unique identifier sku_id (and the internal doc_id). HBase, a column‑store with O(1) row‑key lookup, retrieves the remaining fields (price, stock, etc.) using sku_id as the row key.
Gray‑testing on a 10 M‑row pool with a 100 K target set reduced worst‑case latency from 10–18 s (ClickHouse) to 3–6 s.
Note: This approach requires data duplication in both ES and HBase, increasing system complexity.
RediSearch + RedisJSON Optimization
RediSearch provides a distributed full‑text search engine on top of Redis, while RedisJSON adds native JSON storage and indexing. Together they enable very high‑throughput search operations.
RediSearch Performance
Indexing 5.6 M documents (5.3 GB) took 221 s with RediSearch versus 349 s with Elasticsearch (58 % faster). Under a 32‑client load, RediSearch achieved 12.5 K ops/sec versus 3.1 K ops/sec for Elasticsearch, with latency 8 ms vs 10 ms.
RedisJSON Performance
Isolated writes: 5.4× faster than MongoDB, >200× faster than Elasticsearch.
Isolated reads: 12.7× faster than MongoDB, >500× faster than Elasticsearch.
Mixed workloads: ~50× higher ops/sec than MongoDB and 7× higher than Elasticsearch; latency 90× lower than MongoDB and 23.7× lower than Elasticsearch.
RedisJSON maintains stable latency under high write ratios, whereas Elasticsearch throughput degrades.
Summary of Latency for 100 K Target Set
ClickHouse multithreaded pagination: 10 s – 18 s (worst case).
Elasticsearch scroll‑scan (single‑thread): comparable to ClickHouse, no clear advantage.
ES + HBase hybrid: 3 s – 6 s (worst case).
RediSearch + RedisJSON: promising benchmarks; real‑world latency pending.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
