Mastering High-Availability Clusters with Corosync and Pacemaker
This article explains the principles of high‑availability clusters, details Corosync and Pacemaker architecture, provides installation and configuration steps, and demonstrates a practical HA setup using Corosync, Pacemaker, and NFS to ensure continuous service during node failures.
1. Introduction
High‑availability (HA) clusters aim to reduce service interruption caused by server failures. A cluster is a group of computers that provide network resources as a single entity, with each computer acting as a node.
HA clusters minimize losses from hardware and software errors by automatically detecting failures and switching services to standby nodes within seconds, ensuring continuous service. The core function of HA cluster software is automated fault detection and resource failover.
2. Architecture Overview
With the rapid growth of Internet services, companies cannot afford downtime; for example, a few hours of outage for sites like Taobao can be catastrophic. Operations teams must reduce mean time between failures (MTBF) from both hardware and software perspectives. Corosync is a cluster management suite that, through simple configuration, defines communication methods and protocols, enabling high availability of resources.
Corosync is often paired with Pacemaker, which acts as a resource manager plugin. After installing Pacemaker, it can be enabled in the Corosync configuration. Without a graphical tool like pcs, the crm command‑line utility is used to manage resources.
3. Common Corosync Configurations
Typical combinations include:
heartbeat v1 + hasource
heartbeat v2 + crm
heartbeat v3 + pacemaker + crmsh (v2 adds a voting system for split‑brain scenarios)
corosync v1 + pacemaker
corosync v2 + pacemaker
cman + rgmanager
corosync v1 + cman + pacemaker
CRM: Cluster Resource Management
Resource types:
primitive : basic resource, runs on a single node
group : collection of resources that constitute a service
clone : multiple instances of the same resource across nodes
multi‑state (master/slave) : special clone with master‑slave relationship
Resource agents (RA) categories:
LSB: scripts in /etc/rc.d/init.d/ supporting start/stop/restart/reload/status/force‑reload (cannot be enabled for auto‑boot)
OCF (Open Cluster Framework): located in /usr/lib/ocf/resource.d/, supporting start/stop/status/monitor/meta‑data
STONITH: fencing devices; systemd units also supported (must be enabled for auto‑boot)
Resource constraints:
Location constraints: preference of resources for specific nodes
Order constraints: whether resources can run on the same node
Sequence constraints: start‑up ordering dependencies
Common HA cluster models:
A/P (active/passive): two‑node primary‑backup
A/A (active/active): two‑node primary‑primary
N‑M (N>M): N nodes providing M services, with N‑M standby nodes
During a split‑brain, two isolation levels are used:
STONITH: node‑level fencing by power‑off or reboot
Fencing: resource‑level isolation via network switches
4. Installing and Configuring Corosync
Requirements: hostname resolution between nodes and synchronized time.
Installation (CentOS 7): yum -y install pacemaker Corosync configuration sections include totem, logging, quorum, and nodelist.
Generate a key after configuration: corosync‑kegen -l, then copy the configuration and key to other cluster nodes.
Start services:
systemctl start corosync
systemctl start pacemakerInstall crmsh tools for resource management:
yum -y install crmsh-2.1.4-1.1.x86_64.rpm pssh-2.3.1-4.2.x86_64.rpm python-pssh-2.3.1-4.2.x86_64.rpm5. High‑Availability Example: Corosync + Pacemaker + NFS
Set up an NFS server on a separate machine and mount the same web files on both nodes.
When node 1 is manually set to standby, the resources automatically migrate to node 2, demonstrating location and order constraints as well as node stickiness.
Define a resource monitor for httpd: if the service stops, it is restarted; if restart fails, the resource is moved to another available node.
6. Summary
The Corosync + Pacemaker solution provides high availability with slightly more complexity than LVS, and Corosync can also monitor resource health and generate IPVS rules via ldirectory.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
