Big Data 7 min read

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

This guide explains how to reliably export over a billion Elasticsearch documents within a few hours by using Point‑In‑Time (PIT) snapshots combined with parallel Slice processing, covering diagnostics, performance modeling, consistency levels, failure recovery, and resource isolation.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

Background and Problem

Exporting billions of records directly from a production Elasticsearch cluster often fails:

from/size > 10,000 triggers errors or OOM.

Scroll API is slow, consumes search contexts, and puts heavy load on the cluster.

Single‑threaded SearchAfter works in theory but can take 50+ hours, which is unacceptable.

How can we export massive data efficiently, consistently, and without impacting online services?

Interview‑Style Answer Structure (Three Layers)

1️⃣ Quick Diagnosis

- Total data volume? Document size?
- One‑time or periodic export?
- Consistency requirement? Strong or eventual?
- Available CPU / memory / network?
- Is it allowed to affect live ES query performance?

2️⃣ Solution Evolution

1. Break the limit: avoid from/size.
2. Evolve: Scroll → SearchAfter.
3. Final: PIT + Slice parallelism.

3️⃣ Solution Comparison

Scroll (data < 10M): 5‑10 h, high context overhead, ★★.

SearchAfter (single thread) (data < 50M): 50 h+, no parallelism, ★★.

PIT + Slice parallel (data > 1B): 2‑3 h, requires throttling, ★★★★★.

Core Solution: PIT + Slice Parallel Export

1️⃣ Create PIT

CreatePitRequest request = new CreatePitRequest("my_index");
request.keepAlive(TimeValue.timeValueMinutes(30));
CreatePitResponse response = client.createPit(request, RequestOptions.DEFAULT);
String pitId = response.getId();
Fixes a point‑in‑time view so the export sees a strong‑consistent snapshot regardless of ongoing writes.

2️⃣ Slice Parallel Strategy

int primaryShards = 10;
int slices = Math.min(primaryShards * 2, 20); // typical 20‑32 slices

Guideline: use 1‑2 slices per primary shard, with a total upper limit of 20‑32 slices.

3️⃣ Search Configuration

SearchSourceBuilder sourceBuilder = new SearchSourceBuilder()
    .size(5000)
    .sort("_shard_doc") // required for PIT + SearchAfter
    .fetchSource(new String[]{"id","create_time","amount"}, null)
    .timeout(TimeValue.timeValueMinutes(5));

Sorting field must be _shard_doc for PIT + SearchAfter; _doc is used only with Scroll.

4️⃣ Slice + SearchAfter Loop

SliceBuilder slice = new SliceBuilder(sliceId, maxSlices);
while (true) {
    SearchRequest request = new SearchRequest()
        .source(sourceBuilder)
        .slice(slice)
        .pointInTimeBuilder(new PointInTimeBuilder(pitId).setKeepAlive("5m"));
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    if (response.getHits().getHits().length == 0) break;
    writeToFile(response.getHits());
    Object[] sortValues = response.getHits()
        .getAt(response.getHits().getHits().length - 1)
        .getSortValues();
    sourceBuilder.searchAfter(sortValues);
}

Index Design Impact

Mapping weight dominates export speed. Recommended mapping:

keyword / date fields – optimal.

text fields – slow and large.

nested fields – performance disaster.

Consistency Level Design

Strong consistency : no data loss – use PIT.

Eventual consistency : seconds‑level lag – use SearchAfter.

Final consistency (reporting) : partition by time – no strict snapshot.

Performance and Capacity Model

Assumptions: 1 KB per document → 1 TB raw for 1 B docs; gzip reduces to ~300 GB.

Network: 1 Gbps ≈ 125 MB/s → theoretical 40 min for 300 GB; real‑world 2‑3 h.

Progress Tracking & Resume (Redis)

{
  "status": "running|completed|failed",
  "exported": 12000000,
  "searchAfter": [...]
}

On failure, resume directly from the last searchAfter value without a full restart.

Retry with Exponential Backoff

retryDelay = 2^retryCount * 1000; // ms

Delays: 1→2 s, 2→4 s, 3→8 s, 4→16 s, etc.

Resource Isolation & Business Protection

network:
  throttle_mbps: 100
es:
  search_priority: low
storage:
  local_ssd: true
Data export may be slower, but online services must remain stable.

Data‑Security Boundaries

Mis‑export prevention – dual‑approval workflow.

Sensitive fields – whitelist only.

File leakage – AES encryption.

Host compromise – isolate in a dedicated subnet.

Platform‑Level Design (Enterprise‑Scale)

Modules: Task Management, Permission Approval, PIT Management, Slice Scheduling, Checkpoint‑Resume, Audit Logging.

Why PIT Beats Scroll?

PIT stores only a timestamp, not the whole search context.

PIT is lightweight and avoids context accumulation.

PIT naturally supports Slice for parallelism.

PIT expires automatically, no manual cleanup needed.

One‑Line Summary

This is a complete transformation from a simple export script to an enterprise‑grade Elasticsearch offline data extraction platform.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataElasticsearchSliceData ExportPIT
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.