Databases 19 min read

How to Build a Cloud‑Native High‑Availability MySQL with Kubernetes

This article introduces the SlightShift MySQL high‑availability solution, detailing cloud‑native design principles, architecture, key technologies such as Raft‑based leader election, Local PV storage, ProxySQL routing, automated failover, backup/recovery, and the operator‑driven declarative management that enable scalable, low‑cost, resilient MySQL deployments on Kubernetes.

Alibaba Cloud Developer

Aug 31, 2020

How to Build a Cloud‑Native High‑Availability MySQL with Kubernetes

Introduction

MySQL remains a popular relational database, but in the cloud‑native era it faces challenges such as fault‑tolerance, elastic scaling, data safety, and strong consistency. The SlightShift MySQL high‑availability solution applies cloud‑native design principles—sandbox isolation and complete separation of compute and storage—to deliver low‑cost, scalable, and highly available Cloud RDS.

1. Requirements & Challenges

Automatic failover with data consistency.

Agile elastic scaling without service interruption.

Data security via periodic cold backup and real‑time hot backup.

Strong data consistency between primary and replica nodes.

2. Goals & Key Considerations

SLA 99.99% (max 52.56 min downtime per year).

Failover time < 2 min.

Horizontal scaling of replicas < 2 min.

Cold‑backup recovery < 10 min.

Additional design considerations include high availability, low resource consumption, extensibility, and maintainability.

3. Architecture Design

The solution uses a primary‑replica (one‑master‑many‑slaves) topology with semi‑synchronous replication. An arbiter based on the Raft consensus algorithm provides automatic leader election and failover.

Routing is handled by ProxySQL for read/write splitting and load balancing. Monitoring and alerting rely on Prometheus‑Operator . Declarative management is achieved with a custom Kubernetes Operator that watches custom resources and reconciles the desired state.

4. Key Technologies

State Persistence

Stateful applications like MySQL require persistent storage. In Kubernetes, Local PV is the only viable option for high‑performance, node‑local storage. It uses delayed binding and topology‑aware scheduling so that PVCs are bound only when a suitable node is selected.

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-slightshift-mysql-0
  labels:
    app: slightshift-mysql
    mysql-node: "true"
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: local-volume-storage
  volumeName: mysqlha-local-pv-0
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: slightshift-mysql-data-pv-0
  labels:
    pv-label: slightshift-mysql-data-pv
    type: local
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 500Gi
  local:
    path: /var/lib/ali/mysql
  persistentVolumeReclaimPolicy: Retain
  storageClassName: pxc-mysql-data

Automatic Leader Election (Raft)

Raft uses heartbeats to maintain leader status. If a follower does not receive a heartbeat within the election timeout, it increments its term, becomes a candidate, and requests votes from other nodes. The election succeeds when a majority acknowledges the candidate as the new leader.

Failover Process

HA‑Manager detects master failure and triggers failover.

Optionally shut down the dead master to avoid split‑brain.

Synchronize bin‑log from the dead master to the latest slave to ensure consistent end_log_pos.

Run Raft election to select a new master.

Switch traffic to the new master.

Update ProxySQL configuration to reflect the new topology.

Notify stakeholders via email or DingTalk.

The entire failover completes in 10‑30 seconds, with detection and log application each taking 5‑10 seconds.

Automatic Recovery

When a dead master recovers, Sentinel either re‑adds it as a slave or forces it into read‑only mode, ensuring the cluster remains consistent. Similar logic applies to recovered slave nodes.

Declarative Operations

Using Kubernetes resources (Deployment, StatefulSet, Service, ConfigMap, Secret, etc.) the operator implements a declarative model: users describe the desired MySQL state, and the controller continuously reconciles the actual state to match.

Backup & Restore

Create a CronJob that periodically snapshots MySQL data and uploads it to object storage (Ceph, MinIO).

Use a Job to restore data from a snapshot back to the MySQL master, with slaves automatically syncing.

5. Technical Evolution

Industry forecasts (e.g., Gartner) predict rapid growth of database‑as‑a‑service platforms. Cloud‑native databases will continue evolving to meet diverse workload requirements, emphasizing modularity, automation, and high performance.

6. Future Outlook

Middleware will become more standardized, composable, and platform‑oriented, providing a unified PaaS for enterprise‑grade cloud‑native services. The SlightShift MySQL operator exemplifies this trend by encapsulating operational expertise into reusable, declarative resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native database operator mysql

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.