Operations 12 min read

How Minsheng Bank Mastered Event Management for Seamless IT Operations

This article outlines Minsheng Bank’s evolution of event‑management practices—from early ITIL‑based coarse control to refined, automated processes—detailing the “double‑ten” recovery goal, multi‑dimensional tool support, KPI framework, and closed‑loop management that ensure rapid, reliable data‑center operations.

Efficient Ops
Efficient Ops
Efficient Ops
How Minsheng Bank Mastered Event Management for Seamless IT Operations

1. Introduction

In recent years, driven by economic transformation and the rapid rise of internet finance, the banking sector has faced significant external pressures.

Amid an increasingly complex environment, China Minsheng Bank set a reform and transformation goal, focusing on three strategic positions: serving private enterprises, technology finance, and comprehensive services. Continuous adoption of new technologies and innovative business models has led to expanding information system scale and increasingly complex inter‑system relationships.

Despite growing difficulties and risks, the bank remains committed to a customer‑centric approach and delivering an exceptional experience.

Event management is a critical capability for the stable operation of a bank’s data center, requiring rapid restoration of services after a fault to minimize business impact and ensure continuity.

Minsheng Bank’s operations team pursues a “10‑minute fault location, 10‑minute recovery” “double‑ten” goal, continuously summarizing and advancing toward best‑practice event management.

Since 2010, the bank’s event management has evolved through three stages: coarse‑grained, standardized, and refined management.

2. Development History

Figure 1 – Event Management Development Timeline

First Stage (2010‑2013): Coarse‑grained Management

Based on ITIL best practices and internal conditions, an event management process was established, and a centralized monitoring platform and IT operations system were initially built to manage production incidents.

Second Stage (2013‑2017): Standardized Management

Following ISO 20000 and ISO 27001, the data‑center IT service and information security management system were rebuilt, achieving dual certification and enhancing event‑management tool support.

Third Stage (2017‑Present): Refined Management

Lean thinking, automation, and intelligent platforms were introduced. Pre‑incident risk analysis and monitoring were strengthened, incident handling was standardized, and post‑incident reviews were institutionalized to continuously improve the process.

The current event‑management process features:

01 Problem‑oriented mindset Continuously questioning whether existing processes are optimal, identifying and rectifying issues, and digging deep to uncover root causes.

02 Top‑down management and implementation Leadership actively participates in drafting, revising, and supervising the event‑management system, ensuring accountability.

03 Unified “double‑ten” target Splitting reporting and handling steps, eliminating redundancies, and solidifying procedures to achieve fault location and recovery within ten minutes.

04 Multi‑dimensional tool support Unified monitoring, transaction performance monitoring, log analysis, panoramic operations, and cloud‑map platforms enable rapid fault diagnosis and standardized automated recovery.

Figure 2 – Multi‑dimensional Tool Support

05 Routine incident review and continuous improvement Incident managers organize detailed post‑mortems, weekly reviews, and track corrective actions through linked issue‑management processes.

06 Comprehensive KPI metrics Eleven process indicators for handlers and four result indicators for incidents drive effective operation and ongoing optimization.

Figure 3 – Sample Event Management KPI Dashboard

3. Management Practices

Effective operation relies on a solid management mechanism, closed‑loop processes, robust tool support, and clear responsibility for incident flow owners and managers.

1) Robust Management Mechanism

The bank’s incident‑management policy covers routine incidents, major emergency handling, availability management, monitoring, on‑call duties, and ECC management. Key points include:

01 Clear incident‑resolution objectives Adhering to the “double‑ten” goal, prioritizing rapid restoration.

02 Defined role responsibilities Coordination roles (on‑call manager, supervisor, decision maker) and handling roles (service desk, first‑line, second‑line, third‑line) are clearly delineated.

03 Standardized reporting and handling workflow Both stages are broken down, redundant steps removed, and procedures solidified for maximum efficiency.

After resolution, corrective actions are tracked via problem tickets, and incident severity is assessed based on impact dimensions such as system tier, transaction loss ratio, duration, time of day, customer complaints, and accounting effects.

2) Closed‑Loop Process

The incident‑management lifecycle includes pre‑incident, during‑incident, and post‑incident phases, illustrated below:

Figure 4 – Closed‑Loop Incident Management Process

3) Effective Tool Support

Given rising system complexity, the bank leverages unified monitoring, transaction performance monitoring, log platforms, panoramic operations, and cloud‑map systems for rapid fault location, while an automated operations platform provides standardized emergency actions.

Automation tools cater to various teams: application ops (health checks, service switches, restarts), DBA (performance analysis, index creation), etc., all built with standardization, automation, visualization, scenario‑based, and intelligence principles.

Figure 5 – Transaction Performance Monitoring System

Figure 6 – Panoramic Operations System

Figure 7 – Cloud‑Map System

4) Responsibility Assignment

The bank adopts a “incident flow owner and manager responsibility system.” Flow owners (team leads) coordinate overall incident management, while designated managers (technical experts) assist in process construction and KPI tracking, ensuring continuous improvement.

Through proactive operations, collaborative handling, and systematic reporting, Minsheng Bank consistently meets its incident‑management objectives, enhancing service quality and system availability.

OperationsData Center OperationsEvent ManagementITILBanking IT
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.