Operations 9 min read

Mastering High-Availability Clusters with Corosync and Pacemaker

This article explains the principles of high‑availability clusters, details Corosync and Pacemaker architecture, provides installation and configuration steps, and demonstrates a practical HA setup using Corosync, Pacemaker, and NFS to ensure continuous service during node failures.

MaGe Linux Operations

Dec 27, 2016

Mastering High-Availability Clusters with Corosync and Pacemaker

1. Introduction

High‑availability (HA) clusters aim to reduce service interruption caused by server failures. A cluster is a group of computers that provide network resources as a single entity, with each computer acting as a node.

HA clusters minimize losses from hardware and software errors by automatically detecting failures and switching services to standby nodes within seconds, ensuring continuous service. The core function of HA cluster software is automated fault detection and resource failover.

2. Architecture Overview

With the rapid growth of Internet services, companies cannot afford downtime; for example, a few hours of outage for sites like Taobao can be catastrophic. Operations teams must reduce mean time between failures (MTBF) from both hardware and software perspectives. Corosync is a cluster management suite that, through simple configuration, defines communication methods and protocols, enabling high availability of resources.

Corosync is often paired with Pacemaker, which acts as a resource manager plugin. After installing Pacemaker, it can be enabled in the Corosync configuration. Without a graphical tool like pcs, the crm command‑line utility is used to manage resources.

3. Common Corosync Configurations

Typical combinations include:

heartbeat v1 + hasource

heartbeat v2 + crm

heartbeat v3 + pacemaker + crmsh (v2 adds a voting system for split‑brain scenarios)

corosync v1 + pacemaker

corosync v2 + pacemaker

cman + rgmanager

corosync v1 + cman + pacemaker

CRM: Cluster Resource Management

Resource types:

primitive : basic resource, runs on a single node

group : collection of resources that constitute a service

clone : multiple instances of the same resource across nodes

multi‑state (master/slave) : special clone with master‑slave relationship

Resource agents (RA) categories:

LSB: scripts in /etc/rc.d/init.d/ supporting start/stop/restart/reload/status/force‑reload (cannot be enabled for auto‑boot)

OCF (Open Cluster Framework): located in /usr/lib/ocf/resource.d/, supporting start/stop/status/monitor/meta‑data

STONITH: fencing devices; systemd units also supported (must be enabled for auto‑boot)

Resource constraints:

Location constraints: preference of resources for specific nodes

Order constraints: whether resources can run on the same node

Sequence constraints: start‑up ordering dependencies

Common HA cluster models:

A/P (active/passive): two‑node primary‑backup

A/A (active/active): two‑node primary‑primary

N‑M (N>M): N nodes providing M services, with N‑M standby nodes

During a split‑brain, two isolation levels are used:

STONITH: node‑level fencing by power‑off or reboot

Fencing: resource‑level isolation via network switches

4. Installing and Configuring Corosync

Requirements: hostname resolution between nodes and synchronized time.

Installation (CentOS 7): yum -y install pacemaker Corosync configuration sections include totem, logging, quorum, and nodelist.

Generate a key after configuration: corosync‑kegen -l, then copy the configuration and key to other cluster nodes.

Start services:

systemctl start corosync
systemctl start pacemaker

Install crmsh tools for resource management:

yum -y install crmsh-2.1.4-1.1.x86_64.rpm pssh-2.3.1-4.2.x86_64.rpm python-pssh-2.3.1-4.2.x86_64.rpm

5. High‑Availability Example: Corosync + Pacemaker + NFS

Set up an NFS server on a separate machine and mount the same web files on both nodes.

When node 1 is manually set to standby, the resources automatically migrate to node 2, demonstrating location and order constraints as well as node stickiness.

Define a resource monitor for httpd: if the service stops, it is restarted; if restart fails, the resource is moved to another available node.

6. Summary

The Corosync + Pacemaker solution provides high availability with slightly more complexity than LVS, and Corosync can also monitor resource health and generate IPVS rules via ldirectory.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Corosync Pacemaker

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.