Building Multi‑Active High‑Availability Platforms under Cloud‑Native Architecture – Insights from Ant Group’s SOFAStack
The article presents Ant Group’s SOFAStack experience in designing a cloud‑native, multi‑cluster, high‑availability platform for financial services, covering federation clusters, unified traffic governance with service mesh, unitized hybrid‑cloud evolution, and comprehensive disaster‑recovery mechanisms.
In recent years, cloud‑native concepts such as micro‑services, containers, serverless, and service mesh have rapidly spread, yet financial institutions still face strict performance and security requirements that make many remain cautious about adopting these technologies.
The presentation focuses on three key experiences for building a multi‑active, high‑availability platform under a cloud‑native architecture:
Federated clusters and disaster‑recovery construction: managing application lifecycles across multiple Kubernetes clusters, enabling logical unit management, cross‑region active‑active deployments, and sharding at the data layer.
Unified traffic governance and service mesh: providing a seven‑layer ingress model, cross‑region traffic control, and a unified service‑mesh layer built on Pilot, MOSN, and a custom 7‑layer gateway (Spanner) to support heterogeneous workloads.
Unitized architecture and hybrid‑cloud evolution: moving request‑level sharding to the ingress layer, allowing per‑user routing, supporting active‑active active‑active multi‑region deployments, and abstracting underlying IaaS differences.
SOFAStack’s PaaS layer adds a federation capability on top of Kubernetes, where each data center runs an independent cluster. A federation control plane coordinates resources, applications, and configurations, offering unified release control, group publishing, image management, traffic allocation, metadata, and cluster resource management via console, CLI, or SDK.
High‑availability disaster‑recovery is achieved through a full‑life‑cycle risk‑management process: proactive monitoring and drills, rapid emergency response with automated failover, and post‑incident analysis to continuously improve resilience, covering fault control, detection, emergency handling, capital‑security monitoring, capacity optimization, and capability preservation.
The unified traffic governance leverages a custom Spanner gateway for seven‑layer load balancing, supporting private protocols, security, traffic mirroring, replication, LDC forwarding, blue‑green releases, and disaster‑traffic switching. The service‑mesh implementation combines traditional SDK‑based micro‑services with sidecar‑based mesh, using Pilot for configuration distribution and MOSN for data‑plane traffic, supporting both container/K8s and virtual machine deployments.
Unitized architecture introduces request‑level sharding at the network edge, routing user requests based on dimensions such as UserID to specific data‑center units, thereby reducing cross‑region latency and enabling true multi‑active deployments without cold‑backup centers, improving cost efficiency and resource utilization.
The reference architecture demonstrates how unitization eliminates cross‑data‑center delays, ensures data safety and business continuity, achieves lossless multi‑region disaster recovery, and maximizes data‑center resource usage, laying a solid foundation for hybrid‑cloud scenarios.
Looking forward, SOFAStack aims to become a cross‑cloud operating system for digital finance, offering standardized, out‑of‑the‑box capabilities for hybrid‑cloud management, stability, and security, and has already been adopted by major Ant Group services and dozens of financial institutions.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.