Big Data 7 min read

Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

The team diagnosed intermittent Elasticsearch write‑timeout failures in their real‑time Flink‑to‑Elasticsearch pipeline as lock contention from frequent duplicate updates to the same document IDs, and eliminated the issue by aggregating binlog events in a 5‑second sliding window to deduplicate writes, adjusting refresh intervals, using async translog durability, and disabling non‑essential fields.

HelloTech

Jul 6, 2022

Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

Our team built a search platform that synchronizes data from a relational database to Elasticsearch in real time using Flink. Existing data is imported via Flink, and binlog changes are streamed to Elasticsearch.

The AIoT team maintains a large index on this platform with an average write throughput of 2k‑3k TPS and several hundred QPS for queries. Because the index is critical and resource‑intensive, it is deployed on dedicated machines using Elasticsearch templates.

Since late May, the Flink job writing to this index occasionally failed and restarted. Investigation revealed request timeouts to Elasticsearch, which were traced to excessive write TPS that the cluster could not handle. Scaling the index from two to three machines and performing a write‑load test confirmed that the upgraded capacity eliminated the issue for a period.

In early June the problem resurfaced without an increase in write rate. We examined several metrics:

Index write rate (2k‑3k TPS)

Query rate

JVM GC count

Node merge rate

Node write‑queue length

These metrics showed that a single node’s write queue had grown excessively long, causing write latency. However, GC, merge, and query rates remained stable, suggesting that certain documents were causing unusually slow writes.

We attempted to accelerate indexing by adjusting Elasticsearch settings (increasing refresh_interval, setting index.translog.durability to async, disabling indexing on some fields) and by reducing the Flink job’s write frequency, but each attempt failed.

Further analysis of the problematic node’s thread dump showed eight write threads stuck in WAITING state, each waiting for a document‑level lock. Elasticsearch acquires a lock on the primary key of a document during indexing to prevent concurrent updates. The lock contention indicated that many real‑time updates were targeting the same document IDs, causing severe lock contention.

We verified this hypothesis by inspecting the binlog: the real‑time stream contains multiple updates for the same primary key, while the batch job only writes the latest snapshot. Moreover, the failing nodes varied because different IDs were routed to different shards.

To resolve the issue, we modified the Flink job to aggregate binlog events in a 5‑second sliding window and deduplicate by primary key before writing to Elasticsearch. This reduced the frequency of duplicate IDs, at the cost of a slight increase in data latency (acceptable to the business). After deployment, the write‑timeout errors ceased.

Additional mitigations applied included increasing refresh_interval, setting index.translog.durability to async, and disabling indexing on non‑essential fields.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Indexing Elasticsearch Performance Tuning Data synchronization Write Timeout

Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.