Backend Development 10 min read

How a ‘Broken Pipe’ Error Revealed Hidden Performance Bottlenecks in Production

An unexpected ‘Broken pipe’ error halted production, prompting a deep dive into logs, trace IDs, and monitoring tools like Kibana and SkyWalking, which uncovered Feign client timeouts, a costly SQL update, and redundant microservice calls, leading to targeted fixes that restored normal device operation.

Wukong Talks Architecture

Sep 4, 2025

How a ‘Broken Pipe’ Error Revealed Hidden Performance Bottlenecks in Production

Background

During a routine coding session a production device stopped responding because it kept retrying requests, causing the entire production line to stall.

The symptom was a continuous retry loop that prevented successful production.

Investigation Records

Initial Check

Log analysis with Kibana showed normal business logic and data generation, but the system reported an error after processing the request.

Using the trace id to retrieve the full trace revealed a network‑related exception unrelated to business code:

Controller未处理异常-Broken pipe
--sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
--sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)

AI‑Assisted Explanation

“Broken pipe” is a common network error indicating an attempt to write to a closed socket, often seen in Java NIO applications.

Possible causes include client disconnection, network instability, server‑side closure, or resource limits.

Elimination Process

Alternative devices succeeded, and the same device worked in a pilot environment, suggesting the issue lay in the production system rather than the hardware.

Log‑Level Insight

Searching by trace id and ERROR level uncovered a Feign client timeout log, indicating that an upstream microservice timed out and closed the connection while the downstream service continued processing.

“Broken pipe” was thrown because the upstream service closed the connection.

Why Feign Timed Out?

Performance monitoring with SkyWalking showed that the downstream service took about 20 seconds, exceeding the Feign timeout setting of 10 seconds.

Stage S2 Problem – Heavy SQL

The slow part was an UPDATE statement that modified a large number of rows (≈17 000), causing an 8.5 second delay.

update table_c c set c.del_flag = 1 where c.p = ?

Adding a condition to exclude already‑deleted rows fixes the issue:

update table_c c set c.del_flag = 1 where c.p = ? and c.del_flag = 0

Stage S1 Problem – Redundant Calls

SkyWalking also revealed 133 repeated calls to microservice C, each ~80 ms, adding another ~10 seconds.

list.stream().map(vo -> {
    // call interface C
});

Consolidating identical calls or caching results eliminates the unnecessary overhead.

Root Cause Summary

The production halt was caused by a Feign client timeout (upstream service closed the connection) and two performance bottlenecks: an unfiltered bulk UPDATE and excessive microservice calls. After applying the SQL filter and deduplicating the calls, the device resumed normal production.

Key Takeaways

Analyze both business and system logs; include ERROR level entries.

Use monitoring tools (Kibana, SkyWalking) to pinpoint latency hotspots.

Ensure SQL statements have precise WHERE clauses to avoid massive data scans.

Reduce redundant microservice invocations via caching or call consolidation.

When searching logs, combine trace id with log level for efficient troubleshooting.

Tools Used

Kibana : Log aggregation and search.

SkyWalking : Distributed tracing and performance monitoring.

performance optimization microservices log analysis SQL Tuning backend debugging

Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Investigation Records

Initial Check

AI‑Assisted Explanation

Elimination Process

Log‑Level Insight

Why Feign Timed Out?

Stage S2 Problem – Heavy SQL

Stage S1 Problem – Redundant Calls

Root Cause Summary

Key Takeaways

Tools Used

Wukong Talks Architecture

How this landed with the community

Was this worth your time?

0 Comments

Stage S2 Problem – Heavy SQL

Stage S1 Problem – Redundant Calls