Operations 12 min read

From Midnight Outage to Zero Downtime: Mastering NFS High‑Availability

This article recounts a critical NFS failure that caused massive loss, then walks through practical high‑availability designs—including Keepalived + DRBD, GlusterFS migration, and cloud‑native CSI storage—while sharing real‑world pitfalls, monitoring strategies, and forward‑looking recommendations for resilient file‑system operations.

Ops Community
Ops Community
Ops Community
From Midnight Outage to Zero Downtime: Mastering NFS High‑Availability

Introduction: The midnight NFS outage

On a Friday night at 11 pm, all web servers lost access to user‑uploaded files because the core NFS server suffered a disk failure and there was no failover mechanism. It took four hours to restore service, costing over 500,000 CNY, highlighting the common single‑point‑of‑failure risk in many infrastructures.

Why NFS both loves and haunts ops engineers

NFS (Network File System) acts as a shared drive between servers, supporting static assets, log collection, and configuration sharing in micro‑service architectures. Its drawbacks include single‑point‑of‑failure risk, performance bottlenecks under heavy concurrency, network dependency, and complex lock handling for data consistency.

Practical solutions: Building truly highly‑available storage

Solution 1: NFS HA cluster (traditional upgrade)

1. Keepalived + DRBD dual‑node hot standby

# /etc/keepalived/keepalived.conf
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 150
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mypassword
    }
    virtual_ipaddress {
        192.168.1.100  # VIP address
    }
    notify_master /etc/keepalived/scripts/nfs_master.sh
    notify_backup /etc/keepalived/scripts/nfs_backup.sh
}

DRBD provides block‑level data sync, while Keepalived handles IP failover, keeping recovery time under 30 seconds.

2. NFS optimization

# /etc/exports optimization example
/data/nfs 192.168.1.0/24(rw,sync,no_root_squash,no_all_squash,wdelay,rsize=1048576,wsize=1048576)

# Client mount optimization
mount -t nfs -o vers=3,rsize=1048576,wsize=1048576,hard,intr,timeo=14,retrans=2 192.168.1.100:/data/nfs /mnt/nfs

Key parameters: rsize/wsize=1048576 – enlarge read/write buffers for large‑file throughput. hard,intr – hard mount with interruptible operations for reliability. timeo=14,retrans=2 – reasonable timeout and retry policy.

Solution 2: Migrate to a distributed file system (future‑proof)

1. GlusterFS cluster deployment

# Create a 3‑node replicated volume
gluster volume create web-data replica 3 \
    server1:/data/brick1/web \
    server2:/data/brick1/web \
    server3:/data/brick1/web

# Start and tune performance
gluster volume start web-data
gluster volume set web-data performance.cache-size 256MB
gluster volume set web-data performance.write-behind-window-size 1MB

2. Transparent client failover using systemd

# /etc/systemd/system/glusterfs-mount.service
[Unit]
Description=Mount GlusterFS Volume
After=network.target

[Service]
Type=forking
ExecStart=/bin/mount -t glusterfs server1:web-data /mnt/web-data
ExecStop=/bin/umount /mnt/web-data
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Solution 3: Hybrid – cloud‑native storage

For Kubernetes environments, use a CSI driver:

# StorageClass for NFS CSI
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs-server.example.com
  share: /data/k8s-volumes
reclaimPolicy: Retain
allowVolumeExpansion: true

Lessons learned: Common pitfalls

Pitfall 1 – Ignoring network tuning leads to “ghost” hangs

Symptoms: NFS mount becomes inaccessible; df -h hangs while the server is fine. Root cause: network latency causing RPC timeout.

# Network tuning
echo 'net.core.rmem_default = 262144' >> /etc/sysctl.conf
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_default = 262144' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf

# NFS health check script
#!/bin/bash
check_nfs_health() {
    timeout 10 ls /mnt/nfs > /dev/null 2>&1
    if [ $? -ne 0 ]; then
        echo "NFS mount unhealthy, remounting..."
        umount -fl /mnt/nfs
        mount -t nfs server:/path /mnt/nfs
    fi
}

Pitfall 2 – File‑lock contention causing performance collapse

When many applications write to the same directory, fcntl locks become a bottleneck.

Separate storage directories per application.

Use distributed locks (Redis, etcd) instead of file locks.

Introduce an operation queue to serialize high‑frequency writes.

Pitfall 3 – Naïve backup strategy

Simple rsync schedules break at scale. Improved incremental backup using LVM snapshots:

# Incremental backup script
#!/bin/bash
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
SOURCE_PATH="/data/nfs"
BACKUP_PATH="/backup/nfs"

# Create LVM snapshot (if LVM available)
lvcreate -L1G -s -n nfs_snap_$BACKUP_DATE /dev/vg0/nfs_data

# Backup from snapshot for consistency
rsync -av --link-dest=$BACKUP_PATH/latest \
    /dev/vg0/nfs_snap_$BACKUP_DATE/ \
    $BACKUP_PATH/$BACKUP_DATE/

ln -sfn $BACKUP_PATH/$BACKUP_DATE $BACKUP_PATH/latest

# Remove snapshot
lvremove -y /dev/vg0/nfs_snap_$BACKUP_DATE

Trend outlook: Storage evolution in the cloud‑native era

1. Rise of object storage

Traditional NFS is being replaced by MinIO, Ceph, etc., especially for cloud‑native workloads.

# MinIO cluster on Kubernetes (StatefulSet example)
apiVersion: v1
kind: StatefulSet
metadata:
  name: minio
spec:
  serviceName: minio
  replicas: 4
  template:
    spec:
      containers:
      - name: minio
        image: minio/minio:latest
        command:
        - /bin/bash
        - -c
        args:
        - minio server http://minio-{0...3}.minio.default.svc.cluster.local/data --console-address ":9001"

2. Edge computing challenges

Multi‑region data sync.

Optimization for weak networks.

Autonomous edge nodes.

3. AI/ML workload requirements

Efficient transfer of massive files.

Parallel reads across multiple GPUs.

Versioned training data.

Forward‑looking advice

Follow the evolution of the Kubernetes CSI ecosystem.

Learn operational management of object storage.

Master multi‑cloud data governance.

Monitoring & alerts: Make problems disappear

Building a complete monitoring system is the foundation of high availability.

# Prometheus NFS exporter configuration
# /etc/prometheus/nfs_exporter.yml
nfs_operations_total{type="read"}
nfs_operations_total{type="write"}
nfs_response_time_seconds
nfs_client_connections_total

Core alert rule

groups:
- name: nfs.rules
  rules:
  - alert: NFSHighLatency
    expr: nfs_response_time_seconds > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "NFS response latency is too high"

Conclusion: From reactive firefighting to proactive protection

If the high‑availability architecture described had been in place during the midnight outage, the incident would have been avoided. The stability of the file system directly determines business continuity; whether you upgrade traditional NFS or adopt distributed storage, the key steps are planning ahead, continuous optimization, proactive monitoring, and embracing change.

Monitoringhigh availabilityStoragedistributed file systemNFS
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.