Big Data 12 min read

Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

When faced with a business requirement to filter up to 100 000 records from a pool of tens of millions and then sort and de‑duplicate them, this article explores four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, a combined Elasticsearch‑HBase approach, and RediSearch with RedisJSON—detailing their design, implementation, performance testing, and trade‑offs.

ITPUB
ITPUB
ITPUB
Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

Problem statement

The business requirement is to retrieve no more than 100 000 items from a pool containing tens of millions of rows, then apply weight‑based sorting and a de‑duplication rule that prevents three consecutive items of the same category from appearing.

Baseline solution – multithreaded ClickHouse pagination

Data are exported nightly from Hive to a ClickHouse (CK) table. A SelectionQueryCondition object holds filter and sort criteria. The implementation launches a thread pool, queries each page with LIMIT #{limitStart}, #{limitEnd}, collects the results into a List<Map<String,Object>>, and finally sorts the aggregated list.

int pageSize = this.getPageSize();
int pageCnt = totalNum / pageSize + 1;
List<Map<String,Object>> result = Lists.newArrayList();
List<Future<List<Map<String,Object>>> futureList = new ArrayList<>(pageCnt);
for (int i = 1; i <= pageCnt; i++) {
    SelectionQueryCondition cond = buildSelectionQueryCondition(ruleData);
    cond.setPageSize(pageSize);
    cond.setPage(i);
    futureList.add(selectionEventPool.submit(new QuerySelectionDataThread(cond)));
}
for (Future<List<Map<String,Object>>> f : futureList) {
    List<Map<String,Object>> part = f.get(20, TimeUnit.SECONDS);
    if (CollectionUtils.isNotEmpty(part)) {
        result.addAll(part);
    }
}

Although multithreading reduces latency, deep‑page LIMIT queries degrade sharply, taking 10–18 seconds for a 10 M‑row pool.

Elasticsearch pagination strategies

from + size

scroll

scroll‑scan

search‑after

Single‑threaded ES scroll‑scan

Using the scroll‑scan API in a single thread yields latency comparable to the CK solution, indicating that parallelism is essential for any advantage.

Hybrid ES + HBase query

To avoid deep‑page penalties, a two‑stage scheme is introduced:

Elasticsearch performs only the filter step and returns the unique identifier sku_id (and the internal doc_id).

HBase stores the full attribute set keyed by sku_id. A lookup by sku_id is O(1), equivalent to a HashMap get.

The IDs from ES are used as HBase row keys to fetch required fields (price, stock, etc.) in constant time.

Grey‑scale testing on a 10 M‑row pool reduced worst‑case latency from 10–18 s (CK) to 3–6 s (ES + HBase).

Key observations from ES queries

The fetch phase dominates latency when many fields are requested.

Reducing the number of returned fields (e.g., from 32 to 1 identifier) cuts query time dramatically (9.3 s → 2.4 s in the test).

RediSearch + RedisJSON evaluation

RediSearch is a distributed full‑text search engine built on Redis; RedisJSON adds native JSON storage. Reported benchmark figures (same hardware) are:

Indexing 5.6 M documents: RediSearch 221 s vs. Elasticsearch 349 s (≈58 % faster).

Search throughput: 12.5 K ops/s (RediSearch) vs. 3.1 K ops/s (Elasticsearch) – about 4×.

Average latency: 8 ms (RediSearch) vs. 10 ms (Elasticsearch).

Isolated writes: RedisJSON 5.4× faster than MongoDB and >200× faster than Elasticsearch.

Isolated reads: RedisJSON 12.7× faster than MongoDB and >500× faster than Elasticsearch.

Under mixed workloads RedisJSON maintains higher throughput and lower latency than both MongoDB and Elasticsearch.

The stack promises higher performance but adds system complexity and requires data duplication across ES and Redis.

Overall performance comparison (worst‑case latency for 10 M‑row pool, 100 k target rows)

Multithreaded ClickHouse pagination: 10 s – 18 s.

Single‑threaded ES scroll‑scan: similar to CK, no clear gain.

ES + HBase hybrid: 3 s – 6 s.

RediSearch + RedisJSON: real‑world latency not yet measured; benchmark data suggest potential sub‑second response.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataElasticsearchredisquery optimizationclickhouseHBase
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.