How to Build a Cloud‑Native High‑Availability MySQL with Kubernetes
This article introduces the SlightShift MySQL high‑availability solution, detailing cloud‑native design principles, architecture, key technologies such as Raft‑based leader election, Local PV storage, ProxySQL routing, automated failover, backup/recovery, and the operator‑driven declarative management that enable scalable, low‑cost, resilient MySQL deployments on Kubernetes.
Introduction
MySQL remains a popular relational database, but in the cloud‑native era it faces challenges such as fault‑tolerance, elastic scaling, data safety, and strong consistency. The SlightShift MySQL high‑availability solution applies cloud‑native design principles—sandbox isolation and complete separation of compute and storage—to deliver low‑cost, scalable, and highly available Cloud RDS.
1. Requirements & Challenges
Automatic failover with data consistency.
Agile elastic scaling without service interruption.
Data security via periodic cold backup and real‑time hot backup.
Strong data consistency between primary and replica nodes.
2. Goals & Key Considerations
SLA 99.99% (max 52.56 min downtime per year).
Failover time < 2 min.
Horizontal scaling of replicas < 2 min.
Cold‑backup recovery < 10 min.
Additional design considerations include high availability, low resource consumption, extensibility, and maintainability.
3. Architecture Design
The solution uses a primary‑replica (one‑master‑many‑slaves) topology with semi‑synchronous replication. An arbiter based on the Raft consensus algorithm provides automatic leader election and failover.
Routing is handled by ProxySQL for read/write splitting and load balancing. Monitoring and alerting rely on Prometheus‑Operator . Declarative management is achieved with a custom Kubernetes Operator that watches custom resources and reconciles the desired state.
4. Key Technologies
State Persistence
Stateful applications like MySQL require persistent storage. In Kubernetes, Local PV is the only viable option for high‑performance, node‑local storage. It uses delayed binding and topology‑aware scheduling so that PVCs are bound only when a suitable node is selected.
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-slightshift-mysql-0
labels:
app: slightshift-mysql
mysql-node: "true"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: local-volume-storage
volumeName: mysqlha-local-pv-0
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: slightshift-mysql-data-pv-0
labels:
pv-label: slightshift-mysql-data-pv
type: local
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 500Gi
local:
path: /var/lib/ali/mysql
persistentVolumeReclaimPolicy: Retain
storageClassName: pxc-mysql-dataAutomatic Leader Election (Raft)
Raft uses heartbeats to maintain leader status. If a follower does not receive a heartbeat within the election timeout, it increments its term, becomes a candidate, and requests votes from other nodes. The election succeeds when a majority acknowledges the candidate as the new leader.
Failover Process
HA‑Manager detects master failure and triggers failover.
Optionally shut down the dead master to avoid split‑brain.
Synchronize bin‑log from the dead master to the latest slave to ensure consistent end_log_pos.
Run Raft election to select a new master.
Switch traffic to the new master.
Update ProxySQL configuration to reflect the new topology.
Notify stakeholders via email or DingTalk.
The entire failover completes in 10‑30 seconds, with detection and log application each taking 5‑10 seconds.
Automatic Recovery
When a dead master recovers, Sentinel either re‑adds it as a slave or forces it into read‑only mode, ensuring the cluster remains consistent. Similar logic applies to recovered slave nodes.
Declarative Operations
Using Kubernetes resources (Deployment, StatefulSet, Service, ConfigMap, Secret, etc.) the operator implements a declarative model: users describe the desired MySQL state, and the controller continuously reconciles the actual state to match.
Backup & Restore
Create a CronJob that periodically snapshots MySQL data and uploads it to object storage (Ceph, MinIO).
Use a Job to restore data from a snapshot back to the MySQL master, with slaves automatically syncing.
5. Technical Evolution
Industry forecasts (e.g., Gartner) predict rapid growth of database‑as‑a‑service platforms. Cloud‑native databases will continue evolving to meet diverse workload requirements, emphasizing modularity, automation, and high performance.
6. Future Outlook
Middleware will become more standardized, composable, and platform‑oriented, providing a unified PaaS for enterprise‑grade cloud‑native services. The SlightShift MySQL operator exemplifies this trend by encapsulating operational expertise into reusable, declarative resources.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
