Why Did My Flink Kafka Job Lose Data? Uncovering Misconfigured Bootstrap Servers
A Flink job that reads from Kafka and writes to Elasticsearch was losing data because the bootstrap.servers list mixed production and pre‑release clusters, causing random server selection, partition discovery failures, and offset mismatches, which were resolved by correcting the server configuration.
We ran a Flink job that consumes Kafka messages and writes them to Elasticsearch. Although the logic is simple, we observed data loss without any exceptions, and initial guesses pointed to conversion bugs or dirty data.
Log analysis showed that many messages never entered Flink's internal pipeline, indicating that the consumer was not reading certain partitions. A count revealed that only three out of five Kafka partitions were being consumed.
Further investigation discovered that the bootstrap.servers configuration mixed addresses from the production cluster and a pre‑release cluster:
bootstrap.servers=[business-s2-002-kafka1.xxxx.com.cn:9092,business-s2-002-kafka2.xxxx.com.cn:9092,business-s2-002-kafka3.xxxx.com.cn:9092,pre-kafka1.xxxx.com.cn:9092,pre-kafka2.xxxx.com.cn:9092,pre-kafka3.xxxx.com.cn:9092]According to Kafka documentation, a consumer should try the first address and fall back to the next if it fails, but in practice the client randomly selects one of the listed servers for each request.
The consumer group sp2_group_name_G_2023_10_08_16_58_03 reads the test_topic_A topic, which has five partitions (0‑4) in the production cluster and three partitions (0‑2) in the pre cluster.
Examining FlinkKafkaConsumerBase.java, the key steps are:
In the open method, the consumer reads the initial offsets for each partition based on a timestamp.
During run, the consumer calls partitionDiscoverer.discoverPartitions() to assign partitions to each subtask.
A createFetcher is built to pull data from the selected Kafka server.
The relevant log line (shown in the image) confirms the variables subscribedPartitionsToStartOffsets and specificStartupOffsets that drive the offset logic.
Because the bootstrap list contains both clusters, each subtask may connect to either the production or the pre cluster when fetching offsets and discovering partitions. If a subtask connects to the pre cluster, it only sees partitions 0‑2, causing the missing partition 3 and 4 in the Flink job.
Furthermore, when the subtask later creates a fetcher, it may use offsets obtained from the production cluster (large values) while actually connecting to the pre cluster (which has far fewer messages). This mismatch results in no data being returned for those partitions.
To verify the hypothesis, we launched a second identical Flink job. Its logs showed that the offset‑reading phase only retrieved partitions 0‑2, confirming a connection to the pre cluster, while the subsequent discoverPartitions step assigned partitions from the production cluster, illustrating the random server selection.
After correcting the bootstrap.servers configuration to include only the production Kafka addresses, the consumer consistently discovered all five partitions and the data loss issue disappeared.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
