Operations 13 min read

Why rsync Fails in Unstable Networks and How to Fix It

A series of QA deployment failures caused by rsync errors revealed that network instability, missing timeouts, and zombie processes can break file synchronization, and adding a timeout option together with network diagnostics can mitigate the issue.

Efficient Ops
Efficient Ops
Efficient Ops
Why rsync Fails in Unstable Networks and How to Fix It

Background

On a Wednesday afternoon the first QA environment deployment failed, and three more failures occurred the next day, all using rsync, prompting an immediate investigation.

Problem

rsync client: exception exit

The failure screenshots showed rsync synchronization errors, not a Salt issue. Manual execution of the rsync command on the salt‑master confirmed no response, ruling out Salt and salt‑api problems.

<code># rsync -avz --delete --exclude='.git' --exclude='.svn' rsync://&lt;rsync_srv&gt;:&lt;rsync_port&gt;/path/to/folder /tmp/rsync-test</code><code>receiving incremental file list</code><code>...</code><code>rsync: read error: Connection reset by peer (104)</code><code>rsync error: error in rsync protocol data stream (code 12) at io.c(759) [receiver=3.0.6]</code><code>rsync: connection unexpectedly closed (99 bytes received so far) [generator]</code><code>rsync error: error in rsync protocol data stream (code 12) at io.c(600) [generator=3.0.6]</code>

rsync client: zombie process

Later reports showed the progress bar stuck without obvious errors. Logs indicated the rsync command was sent but produced no result, suggesting the server‑side rsync process had become a zombie. Manually terminating the process (Ctrl+C) produced an exception.

<code># rsync -avzP --delete --exclude='.git' --exclude='.svn' rsync://&lt;rsync_srv&gt;:&lt;rsync_port&gt;/path/to/folder /tmp/rsync-test</code><code>Password: </code><code>receiving incremental file list</code><code>./</code><code>&lt;rsync-test-pkg&gt;-SNAPSHOT.jar</code><code>^C</code><code>rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(551) [generator=3.0.9]</code><code>rsync error: received SIGUSR1 (code 19) at main.c(1298) [receiver=3.0.9]</code>

Subsequent attempts eventually succeeded after multiple retries.

<code># rsync -avzP --delete --exclude='.git' --exclude='.svn' rsync://&lt;rsync_srv&gt;:&lt;rsync_port&gt;/path/to/folder /tmp/rsync-test</code><code>Password: </code><code>receiving incremental file list</code><code>./</code><code>&lt;rsync-test-pkg&gt;-SNAPSHOT.jar</code><code>    60606801 100%   16.13MB/s    0:00:03 (xfer#1, to-check=0/3)</code><code>sent 21035 bytes  received 43167123 bytes  5758421.07 bytes/sec</code><code>total size is 60607326  speedup is 1.40</code>

Reference

Some blogs report that the --timeout option in rsyncd.conf may not work under highly unstable networks, causing rsync processes to become zombies. The default rsync timeout is 0 (no timeout). Setting --timeout both in the daemon configuration and on the client command line can prevent the issue.

Adding

--timeout

to the client command made the failure exit as expected:

<code># rsync -avzP --timeout=60 --delete --exclude='.git' --exclude='.svn' rsync://&lt;rsync_srv&gt;:&lt;rsync_port&gt;/path/to/folder /tmp/rsync-test</code><code>Password: </code><code>receiving incremental file list</code><code>./</code><code>&lt;rsync-test-pkg&gt;-SNAPSHOT.jar</code><code>[receiver] io timeout after 60 seconds -- exiting</code><code>rsync error: timeout in data send/receive (code 30) at io.c(140) [receiver=3.0.9]</code><code>rsync: connection unexpectedly closed (115 bytes received so far) [generator]</code><code>rsync error: error in rsync protocol data stream (code 12) at io.c(605) [generator=3.0.9]</code>

rsync server: logs

<code>2018/10/26 14:40:30 [4228] name lookup failed for &lt;rsync-client&gt;: Name or service not known</code><code>2018/10/26 14:40:30 [4228] connect from UNKNOWN (&lt;rsync-client&gt;)</code><code>2018/10/26 14:40:30 [4228] rsync on path/to/folder from UNKNOWN (&lt;rsync-client&gt;)</code><code>2018/10/26 14:40:30 [4228] building file list</code><code>2018/10/26 14:40:35 [4228] rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Connection timed out (110)</code><code>2018/10/26 14:40:35 [4228] rsync error: error in rsync protocol data stream (code 12) at io.c(1525) [sender=3.0.6]</code>

rsync client: strace

Strace tracing showed no obvious system‑level anomalies, but confirmed timeouts and signal handling.

<code>lstat("&lt;rsync-test-pkg&gt;-SNAPSHOT.jar", 0x7fff7d6e0000) = -1 ENOENT (No such file or directory)</code><code>select(5, [4], [3], [3], {30, 0}) = 2</code><code>... [receiver] io timeout after 60 seconds -- exiting</code><code>rsync error: timeout in data send/receive (code 30) at io.c(140) [receiver=3.0.9]</code><code>+++ exited with 12 +++</code>

Root cause: network quality

The QA environment and the deployment system are in different data centers, connected via a public VPN. Packet captures showed significant loss, reordering, and retransmissions on the cross‑datacenter link, while captures within the same data center showed clean, ordered TCP streams.

In a stable network, wget over HTTP transferred the same file without issues, confirming that rsync’s TCP‑based protocol is more sensitive to packet loss and reordering.

<code># wget http://&lt;sitename&gt;/&lt;rsync-test-pkg&gt;-SNAPSHOT.jar</code><code>--2018-10-26 10:57:45--  http://&lt;sitename&gt;/&lt;rsync-test-pkg&gt;-SNAPSHOT.jar</code><code>Connecting to &lt;sitename&gt;|10.20.51.127|:80... connected.</code><code>Length: 60606801 (58M) [application/java-archive]</code><code>100%[==================================================================================================================================================================&gt;] 60,606,801  9.81MB/s   in 5.2s</code>

Even HTTP traffic showed occasional packet loss, but the higher‑level protocol tolerated it.

Mitigation

Network upgrades (e.g., dedicated lines) are costly, so the immediate workaround is to add a timeout to the rsync command, which forces the client to exit on prolonged stalls. Users can also restart the rsync client server when a failure occurs.

Postscript

The issue remains unresolved for long‑term stability; alternative file‑sync mechanisms need to be evaluated, and further investigation of rsync’s source code is required to understand failure conditions in weak networks.

operationsdeploymentDevOpsnetwork troubleshootingTimeoutrsync
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.