Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch
When faced with a business requirement to filter up to 100 000 records from a pool of tens of millions and then sort and de‑duplicate them, this article explores four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, a combined Elasticsearch‑HBase approach, and RediSearch with RedisJSON—detailing their design, implementation, performance testing, and trade‑offs.
Problem statement
The business requirement is to retrieve no more than 100 000 items from a pool containing tens of millions of rows, then apply weight‑based sorting and a de‑duplication rule that prevents three consecutive items of the same category from appearing.
Baseline solution – multithreaded ClickHouse pagination
Data are exported nightly from Hive to a ClickHouse (CK) table. A SelectionQueryCondition object holds filter and sort criteria. The implementation launches a thread pool, queries each page with LIMIT #{limitStart}, #{limitEnd}, collects the results into a List<Map<String,Object>>, and finally sorts the aggregated list.
int pageSize = this.getPageSize();
int pageCnt = totalNum / pageSize + 1;
List<Map<String,Object>> result = Lists.newArrayList();
List<Future<List<Map<String,Object>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
SelectionQueryCondition cond = buildSelectionQueryCondition(ruleData);
cond.setPageSize(pageSize);
cond.setPage(i);
futureList.add(selectionEventPool.submit(new QuerySelectionDataThread(cond)));
}
for (Future<List<Map<String,Object>>> f : futureList) {
List<Map<String,Object>> part = f.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(part)) {
result.addAll(part);
}
}Although multithreading reduces latency, deep‑page LIMIT queries degrade sharply, taking 10–18 seconds for a 10 M‑row pool.
Elasticsearch pagination strategies
from + size
scroll
scroll‑scan
search‑after
Single‑threaded ES scroll‑scan
Using the scroll‑scan API in a single thread yields latency comparable to the CK solution, indicating that parallelism is essential for any advantage.
Hybrid ES + HBase query
To avoid deep‑page penalties, a two‑stage scheme is introduced:
Elasticsearch performs only the filter step and returns the unique identifier sku_id (and the internal doc_id).
HBase stores the full attribute set keyed by sku_id. A lookup by sku_id is O(1), equivalent to a HashMap get.
The IDs from ES are used as HBase row keys to fetch required fields (price, stock, etc.) in constant time.
Grey‑scale testing on a 10 M‑row pool reduced worst‑case latency from 10–18 s (CK) to 3–6 s (ES + HBase).
Key observations from ES queries
The fetch phase dominates latency when many fields are requested.
Reducing the number of returned fields (e.g., from 32 to 1 identifier) cuts query time dramatically (9.3 s → 2.4 s in the test).
RediSearch + RedisJSON evaluation
RediSearch is a distributed full‑text search engine built on Redis; RedisJSON adds native JSON storage. Reported benchmark figures (same hardware) are:
Indexing 5.6 M documents: RediSearch 221 s vs. Elasticsearch 349 s (≈58 % faster).
Search throughput: 12.5 K ops/s (RediSearch) vs. 3.1 K ops/s (Elasticsearch) – about 4×.
Average latency: 8 ms (RediSearch) vs. 10 ms (Elasticsearch).
Isolated writes: RedisJSON 5.4× faster than MongoDB and >200× faster than Elasticsearch.
Isolated reads: RedisJSON 12.7× faster than MongoDB and >500× faster than Elasticsearch.
Under mixed workloads RedisJSON maintains higher throughput and lower latency than both MongoDB and Elasticsearch.
The stack promises higher performance but adds system complexity and requires data duplication across ES and Redis.
Overall performance comparison (worst‑case latency for 10 M‑row pool, 100 k target rows)
Multithreaded ClickHouse pagination: 10 s – 18 s.
Single‑threaded ES scroll‑scan: similar to CK, no clear gain.
ES + HBase hybrid: 3 s – 6 s.
RediSearch + RedisJSON: real‑world latency not yet measured; benchmark data suggest potential sub‑second response.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
