How a ‘Broken Pipe’ Error Revealed Hidden Performance Bottlenecks in Production
An unexpected ‘Broken pipe’ error halted production, prompting a deep dive into logs, trace IDs, and monitoring tools like Kibana and SkyWalking, which uncovered Feign client timeouts, a costly SQL update, and redundant microservice calls, leading to targeted fixes that restored normal device operation.
Background
During a routine coding session a production device stopped responding because it kept retrying requests, causing the entire production line to stall.
The symptom was a continuous retry loop that prevented successful production.
Investigation Records
Initial Check
Log analysis with Kibana showed normal business logic and data generation, but the system reported an error after processing the request.
Using the trace id to retrieve the full trace revealed a network‑related exception unrelated to business code:
Controller未处理异常-Broken pipe
--sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
--sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)AI‑Assisted Explanation
“Broken pipe” is a common network error indicating an attempt to write to a closed socket, often seen in Java NIO applications.
Possible causes include client disconnection, network instability, server‑side closure, or resource limits.
Elimination Process
Alternative devices succeeded, and the same device worked in a pilot environment, suggesting the issue lay in the production system rather than the hardware.
Log‑Level Insight
Searching by trace id and ERROR level uncovered a Feign client timeout log, indicating that an upstream microservice timed out and closed the connection while the downstream service continued processing.
“Broken pipe” was thrown because the upstream service closed the connection.
Why Feign Timed Out?
Performance monitoring with SkyWalking showed that the downstream service took about 20 seconds, exceeding the Feign timeout setting of 10 seconds.
Stage S2 Problem – Heavy SQL
The slow part was an UPDATE statement that modified a large number of rows (≈17 000), causing an 8.5 second delay.
update table_c c set c.del_flag = 1 where c.p = ?Adding a condition to exclude already‑deleted rows fixes the issue:
update table_c c set c.del_flag = 1 where c.p = ? and c.del_flag = 0Stage S1 Problem – Redundant Calls
SkyWalking also revealed 133 repeated calls to microservice C, each ~80 ms, adding another ~10 seconds.
list.stream().map(vo -> {
// call interface C
});Consolidating identical calls or caching results eliminates the unnecessary overhead.
Root Cause Summary
The production halt was caused by a Feign client timeout (upstream service closed the connection) and two performance bottlenecks: an unfiltered bulk UPDATE and excessive microservice calls. After applying the SQL filter and deduplicating the calls, the device resumed normal production.
Key Takeaways
Analyze both business and system logs; include ERROR level entries.
Use monitoring tools (Kibana, SkyWalking) to pinpoint latency hotspots.
Ensure SQL statements have precise WHERE clauses to avoid massive data scans.
Reduce redundant microservice invocations via caching or call consolidation.
When searching logs, combine trace id with log level for efficient troubleshooting.
Tools Used
Kibana : Log aggregation and search.
SkyWalking : Distributed tracing and performance monitoring.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
