Big Data 7 min read

Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

The team diagnosed intermittent Elasticsearch write‑timeout failures in their real‑time Flink‑to‑Elasticsearch pipeline as lock contention from frequent duplicate updates to the same document IDs, and eliminated the issue by aggregating binlog events in a 5‑second sliding window to deduplicate writes, adjusting refresh intervals, using async translog durability, and disabling non‑essential fields.

HelloTech
HelloTech
HelloTech
Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

Our team built a search platform that synchronizes data from a relational database to Elasticsearch in real time using Flink. Existing data is imported via Flink, and binlog changes are streamed to Elasticsearch.

The AIoT team maintains a large index on this platform with an average write throughput of 2k‑3k TPS and several hundred QPS for queries. Because the index is critical and resource‑intensive, it is deployed on dedicated machines using Elasticsearch templates.

Since late May, the Flink job writing to this index occasionally failed and restarted. Investigation revealed request timeouts to Elasticsearch, which were traced to excessive write TPS that the cluster could not handle. Scaling the index from two to three machines and performing a write‑load test confirmed that the upgraded capacity eliminated the issue for a period.

In early June the problem resurfaced without an increase in write rate. We examined several metrics:

Index write rate (2k‑3k TPS)

Query rate

JVM GC count

Node merge rate

Node write‑queue length

These metrics showed that a single node’s write queue had grown excessively long, causing write latency. However, GC, merge, and query rates remained stable, suggesting that certain documents were causing unusually slow writes.

We attempted to accelerate indexing by adjusting Elasticsearch settings (increasing refresh_interval , setting index.translog.durability to async , disabling indexing on some fields) and by reducing the Flink job’s write frequency, but each attempt failed.

Further analysis of the problematic node’s thread dump showed eight write threads stuck in WAITING state, each waiting for a document‑level lock. Elasticsearch acquires a lock on the primary key of a document during indexing to prevent concurrent updates. The lock contention indicated that many real‑time updates were targeting the same document IDs, causing severe lock contention.

We verified this hypothesis by inspecting the binlog: the real‑time stream contains multiple updates for the same primary key, while the batch job only writes the latest snapshot. Moreover, the failing nodes varied because different IDs were routed to different shards.

To resolve the issue, we modified the Flink job to aggregate binlog events in a 5‑second sliding window and deduplicate by primary key before writing to Elasticsearch. This reduced the frequency of duplicate IDs, at the cost of a slight increase in data latency (acceptable to the business). After deployment, the write‑timeout errors ceased.

Additional mitigations applied included increasing refresh_interval , setting index.translog.durability to async , and disabling indexing on non‑essential fields.

big dataFlinkelasticsearchPerformance TuningData SynchronizationIndexingWrite Timeout
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.