Databases 21 min read

Elasticsearch Indexing Performance Optimization - Part 3

Part 3 of the Elasticsearch indexing optimization guide advises balancing shard and replica counts, using index aliases, leveraging the Bulk API with appropriately sized batches, dedicating data nodes, and upgrading storage (SSD/RAID 0) while monitoring resources to achieve higher throughput and scalable, reliable clusters.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Elasticsearch Indexing Performance Optimization - Part 3

This is the third part of the Elasticsearch indexing optimization series, following the first and second parts. This tutorial series aims to improve indexing performance and reduce monitoring and management pressure through Elasticsearch configuration tuning. The article is translated from the QBox official blog, with copyright belonging to the original author Adam Vanderbush.

Elasticsearch recommends using shard and replica mechanisms to expand and increase index high availability. Having slightly more replicas is beneficial, but too many shards can affect performance. It's often difficult to determine if there are too many shards because it depends on shard size and usage patterns. If there are many shards but low usage frequency, performance may not be too bad. Conversely, even with only two shards, if usage is very frequent, performance can be terrible. Monitor nodes to ensure there are enough idle resources to handle sudden spikes.

Horizontal scaling should include the following aspects. First, reserve enough resources for the next stage of expansion. Once in the next stage, there will be enough time to consider what changes need to be made to reach subsequent stages. You can also get many optimization opportunities from requests sent to Elasticsearch, such as whether to send a separate request for each document? Or can you cache multiple documents to take advantage of the bulk API to index multiple documents through a single request?

We previously focused on indexing performance such as updates, refreshes, segment merges, and automatic throttling. This article will list some strategies regarding shards, replicas, requests, clients, and storage to improve Elasticsearch throughput.

Elasticsearch comes with built-in expansion features. It can run well on both personal computers and clusters containing hundreds of nodes, and this experience is replicable. Gradually expanding from a small cluster to a large cluster is almost completely automatic and easy to achieve. Expanding from a large cluster to an even larger one may require some planning and design, but it is still relatively easy.

Elasticsearch's default settings are sufficient for many scenarios, but if you want better performance, you need to consider how data flows through the system. It could be time-series data (such as log times or social network streams related to recent times) or user-based data (such as collecting a large number of documents by segmenting users or customers).

The API for creating an index allows instantiating an index. Elasticsearch can support multiple indexes, including cross-index operations. Each created index has its own separate configuration. The number of shards for an index must be set during index creation and cannot be modified afterward. If you don't know exactly how much data there will be, consider allocating slightly more shards (but not too many, as there is a cost) to leave room for available expansion. The number of replicas can be changed after the index library is created.

Index aliases provide a way to extend an index after its creation within a certain time period. The index alias API allows giving an index an alias, and all APIs will automatically convert the alias to the corresponding index. An alias can also map to multiple indexes simultaneously, and when an alias is specified, it will automatically expand to all corresponding indexes. When searching or specifying routing, aliases also support automatic mapping using associated filters. Aliases cannot have the same name as an index.

Here's an example of assigning the alias 'alias1' to the index 'test1':

curl -XPOST 'localhost:9200/_aliases' -d '{
  "actions" : [
    { "add" : { "index" : "test1", "alias" : "alias1" } }
  ]
}'

Here's an example of removing an alias:

curl -XPOST 'localhost:9200/_aliases' -d '{
  "actions" : [
    { "remove" : { "index" : "test1", "alias" : "alias1" } }
  ]
}'

Renaming an alias can be done by removing the old one and then re-adding the new alias. These operations are atomic, so there's no need to worry about the alias not mapping to an index during the extremely short period of changing the alias.

Replica shards are crucial for two reasons: 1. Ensuring high availability after a shard or node goes down. Obviously, replica shards cannot be placed on the same node as the original/master shard. 2. Replicas can expand search throughput because searches can be completed in parallel across all replicas.

Although replicas are very important for handling unexpected situations, the more replicas there are, the longer indexing takes. Therefore, it's best not to set replicas during the indexing process. Additionally, unlike the number of shards, the number of replicas can be changed at any time, giving us more options.

For specific situations, such as initializing a new index or migrating data from one index to another, time requirements are often strict. It's best not to configure replicas during the creation process until the end, and then add replicas afterward.

The number of replicas can be completed through the API for updating index configuration:

curl -XPUT 'localhost:9200/my_index/_settings' -d '{
  "index" : {
    "number_of_replicas" : 0
  }
}'

Once the index creation operation is complete, the number of replicas can be set back to the relevant value.

The main benefit of enabling dedicated data nodes is distinguishing between master nodes and data nodes.

This can be set through the following configuration to set up dedicated data nodes:

node.master: false
node.data: true
node.ingest: false

When aggregator nodes handle search requests, they only request the relevant partial data nodes. This reduces the overall load on data nodes so there is enough capacity to handle indexing requests.

If all data nodes are experiencing insufficient disk space, it's necessary to add more data nodes to the cluster. At the same time, ensure the index library has enough primary shards to balance the data across these nodes. Elasticsearch's shard allocation is based on considering the current node's available disk space. By default, if a node's disk usage exceeds 85%, no more shards will be allocated to that node.

For low disk space, there are two remedies. One is to delete expired data and store it outside the cluster. This solution may not be applicable to all users, but if data is stored based on time, you can store a snapshot of an old index's data outside the cluster for backup and cancel replicas for these indexes through index configuration.

If you need to store all data in the current cluster, then the second option will be the only choice: vertical or horizontal expansion. If you choose vertical expansion, this means you need to upgrade hardware. However, to avoid having to upgrade again after exceeding the limit, it's best to consider Elasticsearch's natural advantage of horizontal expansion. To better adapt to future development, it's best to reindex the data into a new index and configure more primary shards in the new index.

The Bulk API makes it possible to execute multiple indexing or deletion operations through a single API request. This will greatly increase indexing speed. Each sub-request is executed independently, so the failure of any sub-request will not affect other sub-requests. If any sub-request fails, the top-level error tag will be set to true, and error details will be printed under the relevant sub-request.

Operations that allow bulk requests include index, create, delete, and update. Index and create require adding the original data in the next line and must be configured with the same op_type as the standard index API. (For example, index will add or replace a document, while create will fail if a document with the same index and type already exists). Delete does not require adding source data in the next line but must have the same syntax as the standard delete API. Update allows filling in partial document data in the next line, inserting, or specifying specific operation scripts.

The entire bulk request needs to be loaded into memory by the node that receives these requests, so the larger the bulk request, the less memory is available for other requests. Therefore, a reasonable value needs to be set for the size of bulk requests. If this value is exceeded, performance will not increase but decrease. This reasonable value is not a fixed value. It completely depends on hardware, document size and complexity, as well as index and search load.

Fortunately, finding this reasonable value is not difficult: try gradually increasing the number of typical documents in batches to test indexing performance. If performance drops, it means the batch is too large. A reasonable starting batch is 1000, then gradually increase to 5000 documents. If the documents are too large, smaller batches can be set. The number of bulk requests depends on the document, whether it's analyzed, and the cluster's configuration, but a reasonable size for a single batch request is 5-15MB. Note that this is a physical value. Using the number of documents to set the size of bulk requests is not strict. For example, if you index 1000 documents in batches each time, you must be clear about the following calculation method:

If 1000 1KB documents, it's only 1MB. While 1000 100KB documents will be 100MB. These are vastly different from bulk sizes. Bulk requests need to be loaded into the memory of the corresponding nodes, so the physical size of the request is more important than the number of documents.

Gradually increase the bulk size from around 5-15MB until you find that performance no longer improves. At this point, you can consider increasing bulk concurrent data import (multi-threading, etc.).

You can monitor whether node resources have reached bottlenecks through Marvel or other commands such as iostat, top, ps. If you start receiving EsRejectedExecutionException , it means the cluster performance has reached saturation, and at least one resource has reached a bottleneck. At this time, either reduce concurrency or supplement the limited resources (such as switching from regular disks to SSDs) or add more nodes.

When importing data, you must ensure that bulk requests cycle through data nodes. Don't send all requests to a single node because that node needs to store all these requests in memory for processing.

Generally, we deploy test environments on personal computers and small-scale clusters; when deploying Elasticsearch to a production environment, there are some suggestions worth referring to: Since Elasticsearch is widely used and can be deployed on various types of machines, there are no fixed rules. However, the following suggestions based on our experience in production environments can still provide a reasonable reference.

Hard drives are usually the bottleneck in modern servers. Elasticsearch uses hard drives extensively, and the higher the disk throughput, the more stable the node. Here are some tips for optimizing disk I/O:

If you can afford SSDs, their performance is superior to any mechanical hard drive. Nodes based on SSDs will have a significant improvement in both query and indexing performance. Of course, if you use mechanical hard drives, you can also try upgrading to faster hard drives (high-performance server hard drives, 15k RPM).

Use RAID 0. Array disks will increase hard drive I/O, and once a disk is damaged, the cost will be very high. There's no need to use mirroring or parity RAID because Elasticsearch ensures high availability through its internal replica mechanism. Using RAID 0 is an effective way to increase hard drive speed, whether for mechanical disks or SSDs.

Avoid using EFS as a means of providing persistent, shared storage and the cost of scaling up or down. Since the file system may cause indexing errors, and Elasticsearch provides distributed and replica mechanisms, there's no need for the advantages provided by EFS.

Do not use remote-mounted storage, such as NFS, SMB/CIFS. The latency caused by this method will directly affect cluster performance.

If the service is deployed on EC2, pay attention to EBS performance. Even EBS options based on SSDs are often slower than local storage. EBS runs well in small clusters (one to two nodes), but for large clusters that bear a heavy load of search and indexing, the performance is poor. If you use EBS, use pre-configured IOPS storage to ensure performance.

Finally, avoid using NAS. People often claim that their NAS solutions are faster and more stable than local storage. Although they say so, we have never seen NAS meet the claimed effects. NAS is often slower, with longer latency, and the average latency fluctuation is very large. At the same time, NAS may also become a single point of failure.

PerformanceoptimizationIndexingscalabilityElasticsearchstorageBulk APIReplicasShards
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.