Design and Optimization of Querying 100k Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch
This article presents a business-driven requirement to extract no more than 100,000 records from a pool of tens of millions, evaluates four technical solutions—including multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—provides implementation details, performance measurements, and practical recommendations for large‑scale data querying.
The business scenario requires filtering up to 100 K items from a pool of tens of millions and then sorting and shuffling them according to configurable weight rules, ensuring that items from the same category do not appear consecutively three times.
Four candidate solutions are proposed: (1) multithreaded pagination on ClickHouse (CK), (2) deep pagination using Elasticsearch scroll‑scan, (3) a combination of Elasticsearch for filtering and HBase for field retrieval, and (4) RediSearch + RedisJSON.
Initial design follows a two‑step process: first filter the pool according to business rules, then sort the result. The pipeline runs a daily job that imports the Hive pool into ClickHouse, builds a SelectionQueryCondition object from the configuration, and uses multiple threads to page through CK, collecting matching rows into a result list.
Example Java pagination code:
// page size (default 5000)
int pageSize = this.getPageSize();
// total pages
int pageCnt = totalNum / this.getPageSize() + 1;
List
> result = Lists.newArrayList();
List
>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
selectionQueryCondition.setPageSize(pageSize);
selectionQueryCondition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}
for (Future
>> future : futureList) {
List
> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
result.addAll(queryRes);
}
}The ClickHouse query method encapsulated in MyBatis is shown below:
public List
> queryPoolSkuList(Map
params) {
List
> resultMaps = new ArrayList<>();
QueryCondition queryCondition = parseQueryCondition(params);
List
> mapList = lianNuDao.queryPoolSkuList(getCkDt(), queryCondition);
if (CollectionUtils.isNotEmpty(mapList)) {
for (Map
data : mapList) {
resultMaps.add(camelKey(data));
}
}
return resultMaps;
}
@Autocomplete
private SqlSession sqlSession;
public List
> queryPoolSkuList(String dt, QueryCondition queryCondition) {
queryCondition.setDt(dt);
queryCondition.checkMultiQueryItems();
return sqlSession.selectList("LianNu.queryPoolSkuList", queryCondition);
}ClickHouse pagination uses LIMIT #{limitStart}, #{limitEnd} . In deep pagination tests on a 10 M‑row pool, the worst‑case latency ranged from 10 s to 18 s.
To address this, Elasticsearch scroll‑scan pagination was evaluated. ES offers several pagination strategies: from + size , scroll , scroll‑scan , and search‑after . Performance tests showed that for result sets under 30 K, ES outperforms CK, while for larger result sets (over 50 K) CK’s multithreaded approach is faster.
Further analysis revealed that the fetch phase dominates latency when many fields are retrieved. Reducing the number of returned fields (e.g., from 100 to 2) cut query time from 9.3 s to 2.4 s on the same dataset.
The ES + HBase hybrid leverages ES solely for ID filtering (returning only sku_id and doc_id ) and uses HBase’s O(1) row‑key lookup to fetch the remaining attributes. In a gray‑scale production test on a 10 M pool, the worst‑case latency improved to 3 s–6 s, at the cost of added system complexity and dual data storage.
RediSearch + RedisJSON was also examined. Benchmarks indicate that RediSearch builds indexes 58 % faster than ES (221 s vs 349 s for 5.6 M documents) and achieves 4× higher query throughput (12.5 K ops/s vs 3.1 K ops/s). RedisJSON outperforms MongoDB and ES by large margins in both isolated reads and writes, and maintains low latency under mixed workloads.
Summary : For extracting 100 K records from a 10 M‑row pool, the observed worst‑case latencies are: • Multithreaded CK pagination – 10 s ~ 18 s • Single‑thread ES scroll‑scan – similar to CK, no clear gain • ES + HBase hybrid – 3 s ~ 6 s • RediSearch + RedisJSON – pending real‑world timing but shows promising performance.
The article concludes with a reminder to support the author’s future work.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.