Operations 9 min read

How DRBD Can Save Your Production Data from Disasters

This article explains why most companies suffer long recovery times after data loss, introduces DRBD's real‑time block replication as a solution, and provides detailed architecture designs, deployment steps, monitoring scripts, performance tuning, cost analysis, common pitfalls, and future trends for reliable disaster recovery.

Ops Community
Ops Community
Ops Community
How DRBD Can Save Your Production Data from Disasters

DRBD: Real‑time Data Replication to Prevent Disasters

Statistics show 78% of companies need over 24 hours to recover from data loss and 43% never fully recover. Traditional backup suffers long windows, uncontrolled RTO, high RPO risk and high cost.

Why DRBD?

DRBD (Distributed Replicated Block Device) provides synchronous network mirroring: every write is duplicated to two servers, allowing failover with seconds‑level downtime.

Key Advantages

RTO: < 30 seconds vs 2‑24 hours for traditional backup

RPO: near‑zero vs hour‑level

Automatic failover

Low cost: ordinary servers instead of expensive storage arrays

Enterprise DRBD Architecture

Primary‑Secondary Hot‑Swap

应用服务器
        ↓
    VIP: 192.168.1.100
        ↓
┌─────────────────┐   DRBD同步   ┌─────────────────┐
│   主节点(A)     │ ←──────────→ │   备节点(B)     │
│ 192.168.1.10   │   专用网络   │ 192.168.1.11   │
│     Primary    │            │   Secondary    │
└─────────────────┘            └─────────────────┘

Dual‑Primary Cluster

负载均衡器
    ┌─────────────┐
    │    LB       │
    └─────────────┘
      ↙           ↘
┌─────────┐   ┌─────────┐
│ 节点A   │   │ 节点B   │
│Primary │ ←→ │Primary │
│Active  │   │Active  │
└─────────┘   └─────────┘
      ↓           ↓
   存储A        存储B

Deployment Checklist

Hardware

Two identical servers

Dedicated gigabit NIC for replication

Equal‑capacity storage

Software

# CentOS 7/8 or Ubuntu 18.04+
# Install DRBD kernel module
yum install drbd90-utils kmod-drbd90 -y

# Verify installation
modprobe drbd
lsmod | grep drbd

Configuration File (/etc/drbd.d/data.res)

resource data {
  protocol C;
  disk {
    on-io-error detach;
    fencing resource-only;
  }
  net {
    after-sb-0pri discard-younger-primary;
    after-sb-1pri discard-secondary;
    after-sb-2pri call-pri-lost-after-sb;
  }
  on node1 {
    device /dev/drbd0;
    disk /dev/sdb1;
    address 192.168.1.10:7789;
    meta-disk internal;
  }
  on node2 {
    device /dev/drbd0;
    disk /dev/sdb1;
    address 192.168.1.11:7789;
    meta-disk internal;
  }
}

One‑click Deployment Script

#!/bin/bash
# DRBD automated deployment
echo "🚀 开始DRBD部署..."
drbdadm create-md data
systemctl enable drbd
systemctl start drbd
if [ "$(hostname)" == "node1" ]; then
    drbdadm primary --force data
    mkfs.ext4 /dev/drbd0
    mkdir -p /data
    mount /dev/drbd0 /data
    echo "✅ 主节点配置完成"
else
    echo "✅ 备节点配置完成"
fi
drbdadm status data

Monitoring & Alerting

Basic Monitoring Script

CONN_STATE=$(drbdadm cstate data)
if [ "$CONN_STATE" != "Connected" ]; then
    echo "🚨 DRBD连接异常: $CONN_STATE"
fi
SYNC_STATUS=$(cat /proc/drbd | grep -o '[0-9]*\.[0-9]*%')
if [ -n "$SYNC_STATUS" ]; then
    echo "📊 同步进度: $SYNC_STATUS"
fi
DISK_STATE=$(drbdadm dstate data)
echo "💾 磁盘状态: $DISK_STATE"

Prometheus Exporter

# prometheus.yml snippet
- job_name: 'drbd'
  static_configs:
    - targets: ['192.168.1.10:9100','192.168.1.11:9100']
  metrics_path: /metrics
  scrape_interval: 30s

Performance Tuning

Network Buffer

echo 'net.core.rmem_max = 67108864' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 67108864' >> /etc/sysctl.conf
drbdadm adjust data

Sync Rate

# Adjust sync speed to match bandwidth
drbdsetup /dev/drbd0 syncer -r 100M
echo "150M" > /sys/block/drbd0/queue/sync_speed_max

Failure‑Injection Drills

Primary Failure Switch

# On backup node
drbdadm primary data
mount /dev/drbd0 /data
ip addr add 192.168.1.100/24 dev eth0

Split‑Brain Resolution

# On secondary node
drbdadm secondary data
drbdadm disconnect data
drbdadm -- --discard-my-data connect data

Cost‑Benefit Analysis

Hardware: two servers + network ≈ 100 k CNY; initial setup 2 person‑days, ongoing 0.5 person‑day/month → ≈ 120 k CNY/year. Traditional enterprise storage 300‑500 k CNY/year, backup software 200‑300 k CNY/year. DRBD can cut costs by over 80%.

Common Pitfalls & Solutions

Split‑brain: configure third‑party arbitrator (handlers fence‑peer, crm‑unfence‑peer)

Slow sync: use dedicated gigabit network, tune TCP buffers, choose protocol C

Long failover: pre‑mount standby node, use keepalived for VIP failover, ensure application fast reconnect

Future Trends

Cloud‑native integration with Kubernetes

AI‑driven intelligent operations

Multi‑cloud disaster recovery

Edge‑computing replication for 5G

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Linuxdisaster recoverydata replicationDRBD
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.