Design and Optimization of Querying 100k Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and Redis
This article presents a comprehensive analysis and multiple design alternatives—including multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, ES+HBase hybrid, and RediSearch+RedisJSON—to efficiently filter, sort, and de‑duplicate up to 100,000 records from a pool of tens of millions, with detailed performance comparisons and code examples.
The author describes a business requirement to select no more than 100 K records from a pool of tens of millions and to sort and de‑duplicate them according to configurable weight rules.
Four initial solution ideas are listed: multithreaded ClickHouse pagination, Elasticsearch scroll‑scan deep pagination, an ES+HBase combination, and a RediSearch+RedisJSON combination.
Initial design with ClickHouse : a daily job imports the Hive pool into ClickHouse, builds a SelectionQueryCondition object from business rules, and uses multithreaded pagination to fetch target data into a result list.
int pageSize = this.getPageSize();
int pageCnt = totalNum / this.getPageSize() + 1;
List<Map<String, Object>> result = Lists.newArrayList();
List<Future<List<Map<String, Object>>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
selectionQueryCondition.setPageSize(pageSize);
selectionQueryCondition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future<List<Map<String, Object>>> future : futureList) {
List<Map<String, Object>> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
result.addAll(queryRes);
}
}The ClickHouse pagination uses LIMIT #{limitStart}, #{limitEnd} in the generated SQL. In deep pagination this approach can take 10–18 seconds for 100 K results from a 10 M‑row pool.
Elasticsearch scroll‑scan optimization : the author evaluates four ES pagination methods (from+size, scroll, scroll‑scan, search‑after) and presents performance tables showing that scroll‑scan improves deep pagination but still lags behind ClickHouse when multithreaded.
ES+HBase hybrid solution : ES is used only for filtering, returning the unique sku_id. HBase, a column‑store with O(1) row‑key lookup, then retrieves the remaining fields. This reduces fetch time dramatically; testing on a 10 M pool shows worst‑case latency dropping from 10–18 s to 3–6 s.
RediSearch+RedisJSON : the article introduces RediSearch (full‑text search on Redis) and RedisJSON (JSON support) as a high‑performance alternative. Benchmarks indicate RediSearch builds indexes 58 % faster than ES, achieves 4× higher query throughput, and offers lower latency. RedisJSON outperforms MongoDB and ES by large margins in isolated reads/writes.
Overall, the comparison of the four approaches demonstrates trade‑offs between implementation complexity, storage duplication, and query latency, guiding practitioners to choose the most suitable stack for large‑scale data selection tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
