Big Data 27 min read

Boost Elasticsearch Performance: Bulk API, Gateway & Caching Secrets

This article explains how to dramatically improve Elasticsearch throughput by using the bulk API, tuning bulk request sizes, configuring gateway settings, optimizing cluster state updates, managing caches, leveraging fielddata and doc values, and employing tools like Curator and the Profiler for efficient cluster operations.

Architecture Talk
Architecture Talk
Architecture Talk
Boost Elasticsearch Performance: Bulk API, Gateway & Caching Secrets

Batch Submission

In the CRUD chapter we learned how data is written to Elasticsearch. Simple programs that index documents one by one achieve only a few hundred writes per second, far from Elasticsearch's potential. Each document requires a full HTTP POST request, which is inefficient, so Elasticsearch provides a bulk API for batch indexing and an mget API for batch reads.

The bulk request uses a line‑delimited format where each action line (metadata) is followed by the source document line. This format allows ES nodes to process each line without parsing a full JSON array, reducing memory usage and GC pressure. Production tools like Logstash, rsyslog, and Spark use the bulk API by default. For custom programs, Perl's Search::Elasticsearch::Bulk or Python's elasticsearch.helpers libraries are recommended.

Bulk Size

When configuring bulk indexing, the request body size must stay below http.max_content_length. However, the bulk size should not be set close to this limit because the entire request body must fit into JVM heap. Oversized requests can exhaust heap and degrade performance. A practical recommendation is to keep bulk request bodies around 15 MB, adjusting based on actual document size and testing.

Gateway

Elasticsearch stores index data via a gateway. By default the gateway.type is local, using local disks. Gateway settings can be tuned to improve recovery: gateway.recover_after_nodes: start recovery only after a certain number of nodes are available. gateway.recover_after_time: wait a configured time after the node count condition is met. gateway.expected_nodes: define the expected total node count before recovery begins.

More granular settings such as gateway.recover_after_data_nodes and gateway.recover_after_master_nodes are also available.

Shadow Replicas on Shared Storage

Although Elasticsearch discourages NFS/iscsi for the gateway, from version 1.5 onward it supports shadow replicas, which store index segments on shared storage without full replication. Enable it by setting node.enable_custom_paths: true and configuring the index with "shadow_replicas": true. Shadow replicas reduce write pressure on replica shards and avoid network copies during recovery, but the primary shard still writes data, so CPU savings are limited. For most cases, using the local gateway with snapshots to HDFS or other backup storage is preferred.

Cluster State Maintenance

The master node manages the cluster state, which includes cluster‑wide settings, node list, index mappings, and shard allocations. All nodes store a copy of the state and can retrieve it via /_cluster/state. Only the master can modify the state, and most changes are lightweight except mapping updates, which occur when new fields appear in documents. Bulk index creation (e.g., daily time‑based indices) can cause noticeable cluster‑state blocking under high load.

Bulk New Index Creation

Creating many new indices at once can block writes while the master propagates the updated state. Scheduling index creation during off‑peak hours (e.g., 3–4 am) mitigates this issue.

Excessive Field Updates

Storing every URL parameter as a separate field (e.g., via Logstash kv filter) inflates the mapping and consumes heap memory, potentially causing OOM. Using a nested object to store key/value pairs reduces mapping explosion but requires nested queries and aggregations.

Nested Object Example

{
  "urlargs": [
    {"key": "uid", "value": "1234567890"},
    {"key": "action", "value": "payload"}
  ]
}

When indexed as a nested type, queries must use the nested query to correctly match key/value pairs.

Cache

Filter Cache

Before ES 2.0, queries and filters were separate; filter results could be cached. Since 2.0, filters are merged into the query DSL, but the engine still distinguishes query (scoring) and filter (no scoring) contexts to decide cache usage. The filter cache is node‑level and configurable via indices.cache.filter.size (default 10% of heap).

Shard Request Cache

The shard request cache stores the results of immutable queries (e.g., on read‑only indices). It is effective when the request JSON does not change (e.g., time‑range queries where the range part is constant). The cache size is set with indices.requests.cache.size (default 1%).

Field Data

Fielddata (uninverted index) accelerates aggregations on text fields but consumes heap memory. Its size can be limited with indices.fielddata.cache.size and indices.fielddata.cache.expire (the latter should not be used). Elasticsearch also provides a circuit breaker ( indices.breaker.fielddata.limit) to prevent OOM.

Doc Values

Doc values store fielddata on disk for exact‑type fields, reducing heap usage. Since ES 5.0, text fields use fielddata, while keyword fields use doc values by default.

Enabling Fielddata on Text Fields

{
  "mappings": {
    "my_type": {
      "properties": {
        "message": {
          "type": "text",
          "fielddata": true,
          "fielddata_frequency_filter": {
            "min": 0.1,
            "max": 1.0,
            "min_segment_size": 500
          }
        }
      }
    }
  }
}

Curator

When indices exceed cluster capacity, the elasticsearch‑curator tool automates index deletion, closing, and optimization. Example commands:

curator --host 10.0.0.100 delete indices --older-than 5 --time-unit days --timestring '%Y.%m.%d' --prefix logstash-mweibo-nginx-
curator --host 10.0.0.100 close indices --older-than 7 --time-unit days --timestring '%Y.%m.%d' --prefix logstash-
curator --host 10.0.0.100 optimize --max_num_segments 1 indices --older-than 1 --newer-than 7 --time-unit days --timestring '%Y.%m.%d' --prefix logstash-

These commands keep recent indices, close older ones, and force segment merges.

Profiler

Elasticsearch 5.0 introduced the profile API to break down query and aggregation execution times. Adding "profile": true to a search request returns detailed timing for collectors, rewrites, scoring, and aggregation phases, helping to tune collect_mode and execution_hint parameters.

Source: http://wangnan.tech/post/elkstack-es03/
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchcachingCluster ManagementBulk APIprofilercuratorfielddata
Architecture Talk
Written by

Architecture Talk

Rooted in the "Dao" of architecture, we provide pragmatic, implementation‑focused architecture content.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.