How Minsheng Bank Uses AIOps to Revolutionize Intelligent Operations
In this talk, the head of Minsheng Bank's intelligent operations platform shares the bank's journey of applying AIOps to tackle massive data, complex dependencies, and operational challenges, outlining the evolution of their technology stack, AI-driven processes, and practical use‑case scenarios.
1. Minsheng Bank Technology Development Trend
Reviewing the bank's tech evolution, it moved from a monolithic architecture around 2000 to a SOA‑based core in 2012, and now to distributed micro‑services, launching the industry’s first distributed core system for direct‑sale banking in 2017.
Application operations manage the full lifecycle of applications—deployment, admission assessment, change review, release, monitoring, and incident handling—acting as a "caretaker" for all application‑related tasks and tools.
The team adopts many Google SRE principles, emphasizing service‑quality ownership, problem discovery, and tool development driven by frontline engineers' needs.
2. Thoughts on Intelligent Operations
Data‑driven operations aim to let data speak and guide decisions, transforming raw metrics (e.g., CPU usage) into alerts, then into information, and finally into knowledge through analysis and experience.
Intelligent operations (AIOps) are positioned as the next‑generation approach to handle massive data, complex relationships, and heavy human dependence, focusing on efficiency improvement, quality assurance, and cost management.
Key limitations include AI’s reliance on historical patterns (no true knowledge creation), difficulty distinguishing correlation from causation, and challenges with incomplete or unstructured operational data.
Current AI techniques excel in well‑defined, labeled scenarios, while many operational tasks lack clear labels, requiring unsupervised methods.
3. Minsheng Bank’s Exploration and Practice
The architecture consists of three layers: the operation‑object layer (data‑center assets, databases, OS, storage), the intelligent operations platform (data, algorithms, compute), and the output layer that enhances quality, efficiency, and cost.
Data collection identified 28 data models; the platform initially ingests high‑quality, complete datasets and continuously improves data quality.
Fault handling follows a structured workflow: alarm reception → impact analysis → root‑cause identification → monitoring metric correlation → log analysis → AI‑assisted decision making.
Use‑case scenarios include:
Availability fault detection: moving beyond fixed‑threshold alerts to adaptive algorithms (e.g., 3‑sigma, isolation forest) that reduce false positives and adapt to holiday/weekend patterns.
Multi‑dimensional fault screening: extracting key dimensions from transaction logs and applying Monte‑Carlo tree search to quickly locate problematic services.
Fault propagation analysis: using a “god‑view” to pinpoint the failing module within long call chains, saving investigation effort.
Monitoring metric triage: scoring metric anomalies with non‑parametric tests and ranking modules by severity.
Log analysis: clustering similar logs, creating templates, converting frequent patterns into time‑series for anomaly detection; a real‑world example showed rapid identification of a file‑system issue.
The overall conclusion emphasizes that intelligent operations are not a panacea; they must be accurately positioned, abstracting repeatable manual steps for automation while preserving human expertise for high‑level decisions, and that data quality is a prerequisite for any AI‑driven solution.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.