LVS Load Balancing: Deep Dive into Four Modes and Step‑by‑Step Deployment
This guide explains the four operating modes of Linux Virtual Server (LVS)—NAT, DR, TUN, and FULLNAT—detailing packet flows, configuration steps, required kernel parameters, health checks, troubleshooting tips, and best‑practice deployment scripts for building a reliable, high‑performance load‑balancing cluster.
Purpose and Audience
The article is written for junior and intermediate operations engineers and backend developers who need to understand and deploy LVS (Linux Virtual Server) in production. It aims to clarify the principles of the four LVS working modes, show how packets travel, and provide a complete, reproducible deployment workflow.
Four LVS Working Modes Overview
NAT – Performs both destination NAT (DNAT) and source NAT (SNAT). All traffic passes through the Director, resulting in the highest CPU and bandwidth load on the Director. Requires the Director and Real Servers to be in the same subnet.
DR (Direct Routing) – Only the destination MAC address is rewritten. Real Servers send responses directly to the client, so the Director handles only the inbound direction. This mode offers the best performance and is the default choice for most high‑traffic scenarios.
TUN – Encapsulates packets in an IPIP tunnel, allowing the Director to forward traffic across different subnets or VLANs. Both the Director and Real Servers must support the IPIP module.
FULLNAT – Extends NAT by translating both source and destination IPs (double NAT). It enables cross‑VLAN or cross‑subnet deployments but requires kernel patches (lvs‑fullnat) or vendor‑specific patches and is not part of the mainline kernel.
Mode‑by‑Mode Details
1. NAT Mode
Network topology – The client sends traffic to the Director’s external interface; the Director DNATs to the Real Server’s IP, then SNATs the response back to the client.
client --> Director (eth0: public IP) --> VIP 10.0.0.100
| |
+---> Real Server 1 (10.0.0.11)
+---> Real Server 2 (10.0.0.12)Packet transformation
# Request
src=1.2.3.4:5000 dst=10.0.0.100:80 # Director DNAT
src=1.2.3.4:5000 dst=10.0.0.11:80
# Response
src=10.0.0.11:80 dst=1.2.3.4:5000 # Real Server replies to Director
src=10.0.0.100:80 dst=1.2.3.4:5000 # Director SNATs back to clientTypical scenarios
Few Real Servers (5‑10)
Need to isolate internal and external networks
Temporary test environments where Real Server network changes are undesirable
2. DR Mode
Network topology – The Director and Real Servers share the same VIP on a loopback alias (lo:0). The Director only rewrites the destination MAC; the Real Server sends the reply directly using the VIP as source.
client --> Director (eth0) --> VIP 10.0.0.100 (shared on lo:0)
Real Server 1 (eth0:10.0.0.11, lo:0:10.0.0.100)
Real Server 2 (eth0:10.0.0.12, lo:0:10.0.0.100)Packet flow
# Request
src=client_ip dst=10.0.0.100:80 # Director changes MAC only
# Response
src=10.0.0.100:80 dst=client_ip # Real Server replies directlyKey requirement – ARP suppression
# On each Real Server
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announceTypical scenarios
High‑throughput web, API, cache, or MySQL read‑load balancing
Dozens of Real Servers within the same VLAN
When the Director and Real Servers are on the same layer‑2 network
3. TUN Mode
Network topology – The Director creates an IPIP tunnel to each Real Server. The original client packet is encapsulated, sent through the tunnel, and the Real Server decapsulates it and replies directly.
client --> Director (VIP 1.1.1.1)
| IPIP tunnel (protocol 50)
+--> Real Server 1 (VIP 1.1.1.1 on tunl0)
+--> Real Server 2 (VIP 1.1.1.1 on tunl0)Packet transformation
# Outer IPIP header
src=Director_DIP dst=RS1_IP
# Inner original packet
src=client_ip dst=VIP:80
# Response after decapsulation
src=VIP:80 dst=client_ipTypical scenarios
Real Servers located in different subnets, data centers, or regions
When you need to avoid making the Director a bottleneck but still require cross‑subnet traffic
Large numbers of Real Servers where NAT would overload the Director
Performance note – CPU overhead for encapsulation is about 5‑10 % on 1 Gbps links and <1 % on 10 Gbps links.
4. FULLNAT Mode
FULLNAT behaves like NAT but also rewrites the source IP, so Real Servers see the Director’s address instead of the client’s. It enables cross‑VLAN or cross‑subnet deployments without requiring the Real Server to have the VIP on a loopback interface. FULLNAT is not part of the upstream kernel; it requires the lvs‑fullnat patch or vendor‑specific binaries (e.g., Alibaba Cloud). It is rarely used for new projects.
Environment Preparation (CentOS 7 / RHEL 7 example)
# Install required packages
yum -y install ipvsadm keepalived
systemctl enable keepalived
# Verify IPVS kernel module
lsmod | grep ip_vs || modprobe ip_vs
# Disable NetworkManager interference (optional but recommended)
systemctl stop NetworkManager
systemctl disable NetworkManager
systemctl restart networkImportant sysctl settings (saved in /etc/sysctl.d/99‑lvs.conf)
# Core LVS settings
net.ipv4.ip_forward = 0 # DR mode – keep disabled
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.rp_filter = 0 # Required for DR and FULLNAT
net.ipv4.conf.default.rp_filter = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_syn_backlog = 65535
net.core.somaxconn = 65535
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_fastopen = 3Apply the settings:
sysctl -p /etc/sysctl.d/99-lvs.confIP Planning
Director (master) – hostname lvs-master, IP 10.0.0.10 (DIP)
Director (backup) – hostname lvs-backup, IP 10.0.0.20 VIP – 10.0.0.100/24 (public address presented to clients)
Real Server 1 – hostname web-01, IP 10.0.0.11 Real Server 2 – hostname web-02, IP
10.0.0.12DR Mode Full Deployment
Real Server Configuration
# Create lo:0 with /32 mask for the VIP
cat > /etc/sysconfig/network-scripts/ifcfg-lo:0 <<'EOF'
DEVICE=lo:0
IPADDR=10.0.0.100
NETMASK=255.255.255.255
ONBOOT=yes
NAME=loopback
EOF
ifup lo:0
# ARP suppression script (lvs‑rs)
cat > /etc/init.d/lvs-rs <<'EOF'
#!/bin/bash
# chkconfig: 2345 90 60
# description: LVS Real Server ARP suppression
VIP=10.0.0.100
case "$1" in
start)
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
;;
stop)
echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
echo 0 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 0 > /proc/sys/net/ipv4/conf/lo/arp_announce
;;
*)
echo "Usage: $0 {start|stop}"
exit 1
;;
esac
EOF
chmod +x /etc/init.d/lvs-rs
chkconfig --add lvs-rs
service lvs-rs startDirector keepalived Configuration (DR)
# /etc/keepalived/keepalived.conf (master example)
! Configuration File for keepalived
global_defs {
router_id LVS_MASTER
notification_email { [email protected] }
notification_email_from [email protected]
smtp_server 127.0.0.1
smtp_connect_timeout 30
}
vrrp_script check_lvs {
script "/usr/local/bin/check_ipvs.sh"
interval 3
weight -20
fall 3
rise 2
timeout 5
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass YourStrongPass
}
unicast_src_ip 10.0.0.10
unicast_peer { 10.0.0.20 }
virtual_ipaddress {
10.0.0.100/24 dev eth0 label eth0:1
}
notify_master "/usr/local/bin/notify.sh master"
notify_backup "/usr/local/bin/notify.sh backup"
notify_fault "/usr/local/bin/notify.sh fault"
track_script { check_lvs }
}
virtual_server 10.0.0.100 80 {
delay_loop 6
lb_algo wlc
lb_kind DR
persistence_timeout 50
persistence_granularity 255.255.255.0
protocol TCP
sorry_server 127.0.0.1 80
real_server 10.0.0.11 80 {
weight 100
TCP_CHECK {
connect_timeout 3
nb_get_retry 3
delay_before_retry 2
connect_port 80
}
}
real_server 10.0.0.12 80 {
weight 100
TCP_CHECK {
connect_timeout 3
nb_get_retry 3
delay_before_retry 2
connect_port 80
}
}
}Validate the configuration:
keepalived -t -f /etc/keepalived/keepalived.conf
systemctl reload keepalivedHealth‑Check Scripts
# /usr/local/bin/check_ipvs.sh
#!/bin/bash
COUNT=$(ipvsadm -Ln --stats 2>/dev/null | awk '/->/ {print $5}' | awk '{s+=$1} END {print s+0}')
if [ "${COUNT:-0}" -lt 1 ]; then
exit 1
fi
exit 0 # /usr/local/bin/notify.sh
#!/bin/bash
TYPE=$1
SUBJECT="[LVS] $TYPE @ $(hostname) @ $(date +%FT%T)"
echo "$SUBJECT" | mail -s "$SUBJECT" [email protected]
logger -t keepalived-notify "$SUBJECT"
exit 0NAT Mode Deployment (Key Differences)
In NAT mode the Real Servers do **not** configure the VIP. The Director must enable IP forwarding and perform double NAT.
# Enable forwarding on the Director
echo 1 > /proc/sys/net/ipv4/ip_forward
# Keep the same sysctl file, but set ip_forward=1 for NATAdjust the keepalived.conf block:
virtual_server 10.0.0.100 80 {
...
lb_kind NAT
...
}On each Real Server set the default gateway to the Director’s internal IP:
# Example on Real Server 1
ip route replace default via 10.0.0.10 dev eth0TUN Mode Deployment (Key Differences)
# Load IPIP module on Real Server
modprobe ipip
echo ipip >> /etc/modules-load.d/tunnel.conf
# Create IPIP tunnel
ip tunnel add tunl0 mode ipip local 10.0.0.11 remote 10.0.0.10
ip link set tunl0 up
ip addr add 10.0.0.100/32 dev tunl0
# Disable rp_filter on the tunnel interface
echo 0 > /proc/sys/net/ipv4/conf/tunl0/rp_filterIn the Director’s keepalived.conf change lb_kind TUN and keep the same VIP definition.
Scheduling Algorithms (lb_algo) and Their Typical Use‑Cases
rr– Simple round‑robin, useful for identical servers and testing. wrr – Weighted round‑robin, for servers with different capacities. lc – Least connections, best for long‑lived connections. wlc – Weighted least connections, the most common choice for high‑throughput web services. sh – Source‑IP hash, provides session persistence without enabling persistence_timeout. dh – Destination‑IP hash, useful for cache clusters. nq – Never queue, suitable for burst traffic.
In production 90 % of cases use wlc.
Troubleshooting Cases
Case 1 – DR Mode SYN Backlog Saturation
Check ipvsadm -Lnc for a large number of SYN_RECV entries.
On Real Servers verify netstat -s | grep -i listen for “SYNs to LISTEN sockets dropped”.
Increase kernel parameters:
echo 65535 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo 65535 > /proc/sys/net/core/somaxconnCase 2 – NAT Mode Real Server Unreachable
Confirm the Director’s ipvsadm -Ln --rate shows traffic going to a single Real Server.
On the problematic Real Server capture packets: tcpdump -i eth0 host 10.0.0.100 and not port 22 Check the default gateway; it must point to the Director’s internal IP.
If the gateway is wrong, restore it or add a policy route for the VIP source.
Case 3 – keepalived VRRP Flapping
Capture VRRP packets: tcpdump -i eth0 vrrp.
Ensure unicast_peer and unicast_src_ip match on both nodes.
Increase vrrp_script interval to ≥3 s and reduce weight magnitude.
Consider disabling preemption with nopreempt if the business tolerates a fixed master.
Case 4 – TUN Mode Latency Spike
Verify encapsulation with tcpdump -i eth0 -nn -p ip proto 50.
Run traceroute to the remote Real Server to detect extra hops.
Remember that each IPIP hop adds 5‑10 ms; keep TUN deployments within 5‑10 ms RTT, otherwise use Anycast BGP or a layer‑7 LB.
Monitoring and Metrics
IPVS does not expose a native metrics endpoint, so a custom exporter is needed. Example node_exporter textfile collector:
# /usr/local/bin/ipvs_metrics.sh
#!/bin/bash
OUT=/var/lib/node_exporter/textfile_collector/ipvs.prom
echo -n > $OUT
ipvsadm -Ln --rate | awk -v ts=$(date +%s) '
/^TCP/ {proto="tcp"; next}
/^UDP/ {proto="udp"; next}
$4 ~ /:/ {
split($4, a, ":")
printf "lvs_rate_in_pps{vip=\"%s\",port=\"%s\",proto=\"%s\"} %s
", a[1], a[2], proto, $5
printf "lvs_rate_out_pps{vip=\"%s\",port=\"%s\",proto=\"%s\"} %s
", a[1], a[2], proto, $6
printf "lvs_rate_in_cps{vip=\"%s\",port=\"%s\",proto=\"%s\"} %s
", a[1], a[2], proto, $7
printf "lvs_rate_out_cps{vip=\"%s\",port=\"%s\",proto=\"%s\"} %s
", a[1], a[2], proto, $8
}
' >> $OUTPrometheus scrape config (example):
scrape_configs:
- job_name: "ipvs"
static_configs:
- targets: ["10.0.0.10:9100"]
labels:
role: lvs-directorTypical alerts (Prometheus rule snippets):
- alert: LVSVipDown
expr: probe_success{job="blackbox_lvs"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "LVS VIP {{ $labels.instance }} unreachable"
- alert: LVSActiveConnHigh
expr: sum(lvs_active_connections) > 500000
for: 5m
labels:
severity: warning
annotations:
summary: "LVS active connections exceed threshold"
- alert: LVSRealServerDown
expr: count(up{job="realserver"} == 0) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "One or more Real Servers are down"Comparison with Other Load‑Balancing Solutions
LVS (DR) – Kernel‑level 4 layer, highest throughput, limited to layer‑4 features, moderate learning curve.
Nginx stream – User‑space 4 layer, good performance, easy L7 extensions, lower learning curve.
HAProxy – User‑space 4/7 layer, rich health checks and ACLs, suitable for large L7 workloads.
F5 / A10 hardware – Dedicated hardware, highest performance, expensive, steep learning curve.
Common industry practice: combine DR‑mode LVS as a fast 4 layer entry with Nginx/HAProxy for L7 routing.
Best‑Practice Checklist
Deploy Director pair with unicast VRRP (single‑hop heartbeat).
Prefer DR mode unless cross‑VLAN is required.
Configure Real Server loopback alias lo:0 with a /32 mask for the VIP.
Apply full ARP suppression on every Real Server (both all and lo namespaces).
Disable rp_filter on lo but keep it enabled on the physical interface.
Use keepalived health checks that verify application logic (HTTP_GET, TCP_CHECK with proper timeouts).
Set persistence_timeout ≤ 60 s; avoid session persistence for pure load‑balancing.
Increase ip_vs_conn_tab_bits to ≥ 18 for > 100 k concurrent connections.
Raise nf_conntrack_max to ≥ 2 M and tune TCP timeouts according to workload.
Monitor VIP reachability, IPVS ActiveConn, Real Server health, Director CPU interrupt usage, and network throughput.
Never clear the rule set with ipvsadm -C in production; always use keepalived for versioned configuration.
Avoid FULLNAT for new projects unless cross‑VLAN is mandatory.
For latency‑sensitive services, enable NIC busy‑poll, GRO, and bind NIC interrupts to dedicated CPU cores.
When running in public clouds, verify that multicast is disabled, MAC address changes are allowed, and security groups permit VRRP (protocol 112).
Conclusion
LVS has been a reliable, kernel‑level load‑balancing solution for over two decades. Its simplicity lies in mastering the four modes, configuring ARP suppression correctly, and using keepalived for HA. By following the step‑by‑step procedures, health‑check scripts, and the checklist above, operators can build a production‑grade LVS cluster that handles tens of millions of connections with minimal latency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
