Operations 5 min read

When a Massive File Transfer Crashed My K8s Master: A Real‑World Docker Recovery Tale

The author recounts a sudden overload caused by copying hundreds of gigabytes of small files to an Alibaba Cloud NAS, which crashed the master node of a Kubernetes cluster, leading to Docker failures, and describes step‑by‑step troubleshooting, configuration changes, and lessons learned about backups, cautious operations, and calm analysis.

Ops Development Stories
Ops Development Stories
Ops Development Stories
When a Massive File Transfer Crashed My K8s Master: A Real‑World Docker Recovery Tale
Hello everyone, I am Joker, an ops engineer and cloud‑native enthusiast.

A fault‑free operation is not a qualified operation; handling faults is essential.

After many years in ops, I still live with the anxiety of unexpected failures that can disrupt daily rhythm.

Most incidents stem from seemingly reasonable actions; this time the cause was copying several hundred gigabytes of data to Alibaba Cloud NAS via an external mount.

The data consisted of countless small files, which simultaneously taxed local disk I/O and network I/O, driving an 8‑core server’s load beyond 500 and crashing the master node of a Kubernetes cluster.

With the server unresponsive, the only option was a reboot. After the restart Docker failed to start, reporting /var/lib/docker/overlays Input/Output error.

The error affected only part of the directory; the overall filesystem remained intact.

I edited Docker’s data‑root by creating a new daemon configuration:

cat > /etc/docker/daemon.json << EOF
{
    "data-root": "/data/docker"
}
EOF

Docker started, but remained unusable, as shown in the following screenshot:

Further investigation revealed that docker-options.conf also set a data‑root. After modifying that file and removing the previous /etc/docker/daemon.json, Docker started normally.

With Docker functional, I restored Etcd from the latest backup, then brought up apiserver, controller‑manager, and other control‑plane components, returning the cluster to normal operation.

The problem’s cause and solution were both unexpected.

Key takeaways: 1) Keep reliable backups; 2) Operate cautiously; 3) Remain calm and analyze thoroughly during incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeDockerKubernetesOps
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.