Debugging a TCP Communication Bug That Stops Device Production at 70% Due to Length Field Misalignment
The article details a mysterious production stall at 70% caused by a TCP packet length field misalignment when the configuration name is "rabbit‑TD", explains the step‑by‑step investigation using server logs and packet captures, identifies the root cause of the merged D4/D5 packets, and proposes two concrete fixes to correct the length handling.
Background
A device registration process communicates with a server over TCP to obtain a configuration file and send a key, but production progress halts at 70% when the configuration name is rabbit‑TD , while rabbit works fine.
Investigation Process
2.1 Check Code
No length restrictions were found on the name field in either client or server code.
2.2 Server Logs
Only D4 stage logs appear; D5 stage logs are missing, suggesting the device does not send D5 data.
2.3 Server Packet Capture
Captured TCP traffic shows the device sends the configuration file (D4) and receives an ACK, but no D5 packet is observed.
2.4 Device Packet Capture
Using #tcpdump -i fetho host 192.168.1.253 the capture reveals that D4 and D5 data are actually merged into a single packet, causing the device to wait for a P6 response.
2.5 Re‑examining Server Packets
The merged packet contains both D4 and D5 data, but the server misinterprets the length fields.
2.6 Packet Analysis
Each stage follows the format: 0x1234abcd, length, type, data . For rabbit , the D4 length is 1011 bytes and D5 length is 256 bytes, fitting within the server's 1024‑byte read buffer. For rabbit‑TD , D4 length becomes 1014 bytes, pushing the D5 length field across the 1024‑byte boundary. The server reads only the first byte of the D5 length, then the next three bytes are taken as part of the data, resulting in an incorrect length value of 65538, which does not match the actual 256‑byte payload and causes a parsing error.
Root Cause
The server’s fixed 1024‑byte read buffer splits the length field of the D5 stage, leading to a misaligned length calculation and subsequent failure to process the D5 payload.
Solutions
Solution 1
When reading the second length field, combine the previously read single byte with the next three bytes to reconstruct the correct length.
Solution 2
Pad the D4 configuration file so that its total size plus the D5 start marker pushes the length field entirely past the 1024‑byte boundary (e.g., make the D4 content 1015 or 1019 bytes), preventing the split.
Conclusion
The bug is caused by the server’s 1024‑byte read limit cutting the D5 length field; fixing the length reconstruction or adjusting payload sizes resolves the production stall.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.