Backend Development 17 min read

Investigation of Elasticsearch RestClient Load‑Balancing and Traffic Skew Issues

The investigation revealed that Elasticsearch RestClient’s built‑in round‑robin and dead‑node blacklisting redirected traffic from failed data‑node addresses—mistakenly included in the static IP list—onto a single client node, causing severe load imbalance and timeouts, which were eliminated after correcting the IP list.

HelloTech
HelloTech
HelloTech
Investigation of Elasticsearch RestClient Load‑Balancing and Traffic Skew Issues

Background: In a distributed system, load balancing is a common capability implemented via round‑robin, random, weighted round‑robin, consistent hash, etc. The author describes a real production incident involving Elasticsearch (ES) client load‑balancing.

Original architecture: Java application → SLB (domain name) → ES ingest node (client‑only, no data) → ES data node. The SLB added an extra RPC hop and cost. After verifying that ES itself can handle a list of IPs with built‑in load‑balancing, the SLB was removed, resulting in the path Java application → ES ingest node → ES data node.

During a scale‑down, the IP list in the ES client was updated. The system experienced timeouts and a noticeable traffic imbalance: one client node handled far more requests than the others.

step1: Updated ES client machine list at 16:xx – system normal.

step2: At 20:5x data nodes were taken offline; a few timeouts appeared and one client node showed much higher traffic.

step3: At 21:1x the high‑traffic client node was taken offline.

step4: At 21:2x another client node became overloaded, causing periodic timeouts.

step5: Suspected ES client SDK initialization issue; restarted Java applications in six batches (~10 machines each).

step6: Timeouts persisted but became a steady low‑level stream rather than bursts.

The root cause was discovered: the IP list mistakenly contained data‑node addresses, some of which had been shut down, leading to failed connections.

After correcting the IP list, errors disappeared and traffic balanced again.

Problem analysis

The load‑balancing behavior was traced to Elasticsearch's RestClient implementation, which uses a round‑robin algorithm with dead‑node blacklisting.

Key source code excerpts (kept intact):

/**
 * Sends a request to the Elasticsearch cluster that the client points to.
 * Blocks until the request is completed and returns its response or fails
 * by throwing an exception. Selects a host out of the provided ones in a
 * round‑robin fashion. Failing hosts are marked dead and retried after a
 * certain amount of time (minimum 1 minute, maximum 30 minutes), depending
 * on how many times they previously failed (the more failures, the later
 * they will be retried). In case of failures all of the alive nodes (or
 * dead nodes that deserve a retry) are retried until one responds or none
 * of them does, in which case an {@link IOException} will be thrown.
 */
public Response performRequest(Request request) throws IOException {
    InternalRequest internalRequest = new InternalRequest(request);
    return performRequest(nextNodes(), internalRequest, null);
}

The method nextNodes() returns a freshly built NodeTuple containing a shuffled list of living nodes. The selection logic is:

static Iterable
selectNodes(NodeTuple
> nodeTuple,
                                   Map
blacklist,
                                   AtomicInteger lastNodeIndex,
                                   NodeSelector nodeSelector) throws IOException {
    List
livingNodes = new ArrayList<>(Math.max(0, nodeTuple.nodes.size() - blacklist.size()));
    List
deadNodes = new ArrayList<>(blacklist.size());
    for (Node node : nodeTuple.nodes) {
        DeadHostState deadness = blacklist.get(node.getHost());
        if (deadness == null || deadness.shallBeRetried()) {
            livingNodes.add(node);
        } else {
            deadNodes.add(new DeadNode(node, deadness));
        }
    }
    if (livingNodes.isEmpty()) {
        // rotate the list using a global counter so subsequent requests try nodes in a different order
        Collections.rotate(livingNodes, lastNodeIndex.getAndIncrement());
        return livingNodes;
    }
    // default selector does nothing
    nodeSelector.select(livingNodes);
    return livingNodes;
}

Important observations:

The returned list is a new collection, leaving the original configuration untouched.

Dead nodes are filtered out based on deadness.shallBeRetried() .

Round‑robin is achieved via Collections.rotate() with an AtomicInteger counter shared across threads.

The first element of the rotated list is used for the request, so when a node fails, its traffic is shifted to the next living node, causing the observed imbalance.

Dead‑node handling

boolean shallBeRetried() {
    return timeSupplier.get() - deadUntilNanos > 0;
}

DeadHostState(DeadHostState previousDeadHostState) {
    long timeoutNanos = (long)Math.min(MIN_CONNECTION_TIMEOUT_NANOS * 2 *
        Math.pow(2, previousDeadHostState.failedAttempts * 0.5 - 1),
        MAX_CONNECTION_TIMEOUT_NANOS);
    this.deadUntilNanos = previousDeadHostState.timeSupplier.get() + timeoutNanos;
    this.failedAttempts = previousDeadHostState.failedAttempts + 1;
    this.timeSupplier = previousDeadHostState.timeSupplier;
}

When a request fails, onFailure(context.node) adds the node to the blacklist, and the client recursively retries the next node until a successful response is obtained.

Answers to the three questions

Why load imbalance? A failed node’s traffic is redirected to the next healthy node by the round‑robin algorithm, concentrating load.

Why the bursty timeouts and later dispersion? Updating the IP list across many Java instances introduced several dead nodes simultaneously, causing a wave of immediate failures. After each node entered the blacklist for a minute (or longer with repeated failures), the retry attempts were staggered, dispersing the timeout pattern.

Why does blacklisting still produce errors? The first request to a newly blacklisted node still occurs before the blacklist takes effect, and after the back‑off period the node may be retried again if the back‑off has expired, leading to occasional errors.

Conclusion

When using the Elasticsearch RestClient with a static IP list (bypassing an external load balancer), the built‑in round‑robin and dead‑node logic can cause traffic spikes and periodic timeouts if the list contains unavailable nodes. Properly validating the IP list and understanding the retry/back‑off behavior are essential to avoid unexpected latency and load‑skew.

Backend DevelopmentElasticsearchload balancingJava ClientRound Robin
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.