From Midnight Outage to Zero Downtime: Mastering NFS High‑Availability
This article recounts a critical NFS failure that caused massive loss, then walks through practical high‑availability designs—including Keepalived + DRBD, GlusterFS migration, and cloud‑native CSI storage—while sharing real‑world pitfalls, monitoring strategies, and forward‑looking recommendations for resilient file‑system operations.
Introduction: The midnight NFS outage
On a Friday night at 11 pm, all web servers lost access to user‑uploaded files because the core NFS server suffered a disk failure and there was no failover mechanism. It took four hours to restore service, costing over 500,000 CNY, highlighting the common single‑point‑of‑failure risk in many infrastructures.
Why NFS both loves and haunts ops engineers
NFS (Network File System) acts as a shared drive between servers, supporting static assets, log collection, and configuration sharing in micro‑service architectures. Its drawbacks include single‑point‑of‑failure risk, performance bottlenecks under heavy concurrency, network dependency, and complex lock handling for data consistency.
Practical solutions: Building truly highly‑available storage
Solution 1: NFS HA cluster (traditional upgrade)
1. Keepalived + DRBD dual‑node hot standby
# /etc/keepalived/keepalived.conf
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 150
advert_int 1
authentication {
auth_type PASS
auth_pass mypassword
}
virtual_ipaddress {
192.168.1.100 # VIP address
}
notify_master /etc/keepalived/scripts/nfs_master.sh
notify_backup /etc/keepalived/scripts/nfs_backup.sh
}DRBD provides block‑level data sync, while Keepalived handles IP failover, keeping recovery time under 30 seconds.
2. NFS optimization
# /etc/exports optimization example
/data/nfs 192.168.1.0/24(rw,sync,no_root_squash,no_all_squash,wdelay,rsize=1048576,wsize=1048576)
# Client mount optimization
mount -t nfs -o vers=3,rsize=1048576,wsize=1048576,hard,intr,timeo=14,retrans=2 192.168.1.100:/data/nfs /mnt/nfsKey parameters: rsize/wsize=1048576 – enlarge read/write buffers for large‑file throughput. hard,intr – hard mount with interruptible operations for reliability. timeo=14,retrans=2 – reasonable timeout and retry policy.
Solution 2: Migrate to a distributed file system (future‑proof)
1. GlusterFS cluster deployment
# Create a 3‑node replicated volume
gluster volume create web-data replica 3 \
server1:/data/brick1/web \
server2:/data/brick1/web \
server3:/data/brick1/web
# Start and tune performance
gluster volume start web-data
gluster volume set web-data performance.cache-size 256MB
gluster volume set web-data performance.write-behind-window-size 1MB2. Transparent client failover using systemd
# /etc/systemd/system/glusterfs-mount.service
[Unit]
Description=Mount GlusterFS Volume
After=network.target
[Service]
Type=forking
ExecStart=/bin/mount -t glusterfs server1:web-data /mnt/web-data
ExecStop=/bin/umount /mnt/web-data
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetSolution 3: Hybrid – cloud‑native storage
For Kubernetes environments, use a CSI driver:
# StorageClass for NFS CSI
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
server: nfs-server.example.com
share: /data/k8s-volumes
reclaimPolicy: Retain
allowVolumeExpansion: trueLessons learned: Common pitfalls
Pitfall 1 – Ignoring network tuning leads to “ghost” hangs
Symptoms: NFS mount becomes inaccessible; df -h hangs while the server is fine. Root cause: network latency causing RPC timeout.
# Network tuning
echo 'net.core.rmem_default = 262144' >> /etc/sysctl.conf
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_default = 262144' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
# NFS health check script
#!/bin/bash
check_nfs_health() {
timeout 10 ls /mnt/nfs > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "NFS mount unhealthy, remounting..."
umount -fl /mnt/nfs
mount -t nfs server:/path /mnt/nfs
fi
}Pitfall 2 – File‑lock contention causing performance collapse
When many applications write to the same directory, fcntl locks become a bottleneck.
Separate storage directories per application.
Use distributed locks (Redis, etcd) instead of file locks.
Introduce an operation queue to serialize high‑frequency writes.
Pitfall 3 – Naïve backup strategy
Simple rsync schedules break at scale. Improved incremental backup using LVM snapshots:
# Incremental backup script
#!/bin/bash
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
SOURCE_PATH="/data/nfs"
BACKUP_PATH="/backup/nfs"
# Create LVM snapshot (if LVM available)
lvcreate -L1G -s -n nfs_snap_$BACKUP_DATE /dev/vg0/nfs_data
# Backup from snapshot for consistency
rsync -av --link-dest=$BACKUP_PATH/latest \
/dev/vg0/nfs_snap_$BACKUP_DATE/ \
$BACKUP_PATH/$BACKUP_DATE/
ln -sfn $BACKUP_PATH/$BACKUP_DATE $BACKUP_PATH/latest
# Remove snapshot
lvremove -y /dev/vg0/nfs_snap_$BACKUP_DATETrend outlook: Storage evolution in the cloud‑native era
1. Rise of object storage
Traditional NFS is being replaced by MinIO, Ceph, etc., especially for cloud‑native workloads.
# MinIO cluster on Kubernetes (StatefulSet example)
apiVersion: v1
kind: StatefulSet
metadata:
name: minio
spec:
serviceName: minio
replicas: 4
template:
spec:
containers:
- name: minio
image: minio/minio:latest
command:
- /bin/bash
- -c
args:
- minio server http://minio-{0...3}.minio.default.svc.cluster.local/data --console-address ":9001"2. Edge computing challenges
Multi‑region data sync.
Optimization for weak networks.
Autonomous edge nodes.
3. AI/ML workload requirements
Efficient transfer of massive files.
Parallel reads across multiple GPUs.
Versioned training data.
Forward‑looking advice
Follow the evolution of the Kubernetes CSI ecosystem.
Learn operational management of object storage.
Master multi‑cloud data governance.
Monitoring & alerts: Make problems disappear
Building a complete monitoring system is the foundation of high availability.
# Prometheus NFS exporter configuration
# /etc/prometheus/nfs_exporter.yml
nfs_operations_total{type="read"}
nfs_operations_total{type="write"}
nfs_response_time_seconds
nfs_client_connections_totalCore alert rule
groups:
- name: nfs.rules
rules:
- alert: NFSHighLatency
expr: nfs_response_time_seconds > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "NFS response latency is too high"Conclusion: From reactive firefighting to proactive protection
If the high‑availability architecture described had been in place during the midnight outage, the incident would have been avoided. The stability of the file system directly determines business continuity; whether you upgrade traditional NFS or adopt distributed storage, the key steps are planning ahead, continuous optimization, proactive monitoring, and embracing change.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
