How Major Banks Design Disaster‑Recovery Architecture for Uninterrupted Service
This article examines banking regulatory requirements and typical disaster‑recovery architectures, explains system tiering and recovery‑time objectives, and shares the Industrial and Commercial Bank of China's evolution from a two‑site, two‑center model to a cloud‑native, multi‑center disaster‑recovery framework, offering practical design insights.
1 Background
Banking is a critical carrier of the national economy; data loss and service interruption are red lines for commercial banks. Therefore, a bank's information system disaster‑recovery (DR) architecture must ensure business continuity, meet DR level requirements, and achieve disaster‑avoidance goals.
The DR capability of an information system includes high availability at the production site and disaster takeover at the DR site, ensuring normal operation when faced with natural disasters, equipment failures, or human‑caused incidents.
This article outlines regulatory requirements, typical commercial‑bank DR architectures, key design points, and shares the Industrial and Commercial Bank of China's (ICBC) DR planning and management experience.
2 Banking Regulatory Requirements
In November 2007, the first national standard for DR, "Information System Disaster Recovery Specification" (GB/T20988‑2007), was released, defining six DR levels, RPO/RTO requirements, and a framework for planning, approval, implementation, and management.
In February 2008, the People’s Bank of China issued the financial industry standard "Banking Information System Disaster Recovery Management Specification" (JR/T 0044‑2008), adapting the national standard to banking-specific processes, organization, and supervision.
In April 2010, the China Banking Regulatory Commission (CBRC) released the "Commercial Bank Data Center Supervision Guidelines", requiring banks to establish a production center within two years of licensing and a DR center within two years thereafter, with higher‑asset banks needing an off‑site DR center and a DR capability of at least level 5.
In December 2011, the CBRC issued the "Commercial Bank Business Continuity Supervision Guidelines", setting RTO ≤ 4 hours and RPO ≤ 30 minutes for critical business.
3 Typical Commercial‑Bank DR Architectures
DR construction usually involves storage, compute, and network design, selected according to data‑center deployment and system‑level HA/DR needs.
Typical data‑center deployment structures include two‑site‑two‑center, two‑site‑three‑center, and multi‑site‑multi‑center models.
3.1 Data‑Center Deployment Structures
(1) Two‑site‑two‑center: Consists of a production center and an off‑site DR center.
(2) Two‑site‑three‑center: Adds a same‑city DR center to the production and off‑site DR centers, providing higher continuity and mitigating single‑site limitations.
(3) Multi‑site‑multi‑center: Three or more equally‑ranked data centers, each capable of handling normal operations and taking over critical or all workloads.
3.2 System Tiered DR Design
1. Information‑system tiering
Based on impact of interruption and continuity goals, systems are classified into three categories (referencing JR/T 0044‑2008):
Class 1 systems: Extremely low tolerance for interruption; require the highest HA and DR capability (e.g., settlement, payment systems).
Class 2 systems: Moderate tolerance; require high but not maximal HA/DR (e.g., loan, customer‑service systems).
Class 3 systems: Higher tolerance; lower HA/DR requirements (e.g., office applications).
2. DR capability level design
According to system class, the minimum DR level, RTO and RPO are defined:
Class 1: DR level ≥ 5, RTO < 6 hours, RPO < 15 minutes.
Class 2: DR level ≥ 3, RTO < 24 hours, RPO < 120 minutes.
Class 3: DR level ≥ 2, RTO < 2 days, RPO < 7 days.
3. DR architecture design
Combining system class, DR level, and data‑center deployment yields reference architectures:
Class 1: "Same‑city active‑active + off‑site DR" or "off‑site multi‑active" to survive park‑level or city‑level disasters.
Class 2: "Same‑city hot‑standby" or "off‑site hot‑standby" for park‑ or city‑level events.
Class 3: "Same‑city cold‑standby" or "off‑site cold‑standby", or a single‑site deployment when no park‑level DR is required.
4 ICBC DR Architecture
4.1 Evolution of ICBC DR Architecture
1) Two‑site‑two‑center
After completing data centralization in 2002, ICBC built a thousand‑kilometer‑scale DR link between Shanghai and Beijing within one year, pioneering inter‑city DR in Chinese banking.
2) Two‑site‑three‑center
In 2009, ICBC proposed a "two‑site‑three‑center" strategy; by 2011 the technical route was finalized, and by 2014 the same‑city center in Jiading was built, achieving minute‑level failover and full active‑active deployment.
3) Cloud‑native distributed two‑site‑three‑center
Since 2015, ICBC has transformed from mainframe‑centric to an open‑platform, cloud‑native distributed architecture. By 2022, it established a three‑tier high‑availability system (local, same‑city, off‑site) for the open‑platform core banking system, ensuring HA comparable to legacy mainframes. Local HA focuses on node isolation; same‑city HA provides MySQL cluster park‑level switch within one minute; off‑site HA implements a distributed DR switch.
4.2 ICBC DR Architecture Management
1) Standards and guidelines
ICBC built an enterprise‑level standards system aligned with banking DR regulations, defining nine DR grades, best‑practice HA designs, and reference architectures to achieve fine‑grained business continuity management.
2) Digital management of architecture assets
Using an architecture‑asset control platform, ICBC models applications into "application‑subsystem‑logical node‑physical device" hierarchies, enabling online, visual DR architecture management and providing decision support for development and operation.
3) Periodic inspections and maturity assessment
ICBC conducts regular enterprise‑standard inspections covering design, development, testing, and operation, and evaluates DR maturity through multi‑layer assessments of static and dynamic indicators across devices, systems, networks, and applications.
5 Conclusion
A robust DR architecture is the core guarantee of business continuity for commercial banks and a strong support for innovation. As banks migrate core services to ecosystem platforms and pursue autonomous infrastructure, DR architectures will continue to evolve. ICBC will keep sharing industry experience, participating in standards, and contributing design concepts and implementation practices.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.