Operations 12 min read

Mastering High Availability Clusters: Key Concepts, Resource Management, and Failure Handling

This article explains how high‑availability (HA) clusters provide redundancy for directors, RS‑servers, databases and storage, covering active‑passive node roles, resource stickiness, constraints, quorum voting, split‑brain avoidance, failure detection methods, and essential configuration tips.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Mastering High Availability Clusters: Key Concepts, Resource Management, and Failure Handling

HA (High Availability) clusters provide redundancy for front‑end directors, back‑end RS‑servers, database servers and shared storage by using active‑primary and passive‑standby nodes, so that if a primary fails the backup takes over immediately.

Primary (active) and backup (passive) nodes are defined for the director, RS‑servers and storage such as MySQL, each with its own failover resources.

HA focuses on reliability and stability; availability is calculated as service‑up time divided by (service‑up time + downtime) and expressed in nines (99%, 99.9%, …) with five‑nines required for some financial systems.

When a node fails, its resources (VIP, services, devices, filesystems) must be transferred to another node. Resource stickiness (preference) and constraints determine which node receives the resources.

Stickiness scores >0 indicate preference; negative scores force migration. Constraints include colocation (whether resources can share a node) and location (node‑specific scores). The higher combined score wins.

Order constraints define start/stop sequences for dependent resources such as VIP before IPVS rules.

Resource types:

Primitive – runs on a single node.

Clone – runs on every node.

Group – moves as a unit.

Master/Slave – runs on two nodes, one primary.

Backup nodes detect primary failures via heartbeat messages. In clusters with three or more nodes, a quorum voting mechanism decides node legality; odd numbers of votes are preferred, and weighted votes can be assigned.

When a node is deemed illegal, actions include Freeze (process existing requests only), Stop (halt services and migrate resources), or Ignore (continue running, used only for two‑node pairs).

For a MySQL service, required resources are VIP, floating IP, MySQL service, and a mounted filesystem.

Split‑brain scenarios occur when multiple nodes write to the same file after an unsynchronized failover; preventing this requires node isolation (e.g., power‑off via STONITH) and storage isolation (e.g., FC‑SAN).

Node failure detection methods include shared‑disk arbitration, gateway ping, and watchdog timers.

The messaging layer (UDP/694) transports heartbeats, stickiness, and constraints; the Cluster Resource Manager (CRM) decides resource placement and orchestrates migrations via its components PE (policy engine), TE (transaction engine), and LRM (local resource manager). Resource agents (RA) implement start/stop/status scripts (LSB, OCF, etc.).

Typical HA stack combinations are:

haresource + heartbeat v1/v2

crm + heartbeat v2

pacemaker + corosync

pacemaker + heartbeat v3

cman + ragmanager

A minimal web‑service HA cluster needs at least two nodes running the messaging layer and CRM, and defines four resources: VIP, HTTP service, filesystem, and STONITH device.

Configuration tips: node names must match uname -n, use /etc/hosts for name resolution, synchronize time, enable SSH trust, and ensure CRM services start automatically while application services are managed by CRM.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Operationshigh availabilityResource ManagementClusterfailover
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.