How a Banking Card Organization Built a Scalable Cloud Operations Platform
This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.
Preface
Operations teams typically progress from standardization to tool‑based automation, and finally to intelligent platforms that consolidate tools, capture expertise, and enable self‑service.
Overview of the Banking Card Organization Cloud Operations Platform
The IT system was initially built on ITIL processes for change, incident, problem, and service management. Five years ago it evolved into a cloud‑native operations platform with an IaaS virtualization layer and a CMDB for unified data management.
Three primary needs drove the redesign:
Rapidly increasing hardware nodes required an efficient, adaptable automation platform to reduce repetitive work.
Capturing operators’ knowledge into a reusable, intelligent scenario library to improve quality and efficiency.
Embedding intelligence into traditionally manual workflow decisions to shift from human judgment to machine‑assisted analysis.
The resulting platform operates in a cloud environment and addresses these needs.
Key Product Features
The platform provides four major capabilities:
Unified resource scheduling across OpenStack, database, container, storage, network, and security services, exposing a single API for self‑service operations.
Comprehensive automation for data collection, application installation, configuration, updates, analysis, scaling, backup, and recovery.
Multi‑dimensional visualizations offering role‑specific views (network, system, monitoring, reporting) and customizable dashboards.
High‑performance handling of tens of thousands of nodes concurrently.
Platform Construction Scenarios
Core modules include execution, data collection, and integration with other processes, forming an "Operations OS" that manages automated topology, custom reports, and full lifecycle from deployment to decommission.
Scenario 1: Lifecycle Management
Traditional manual deployments involve lengthy hand‑offs and potential errors. The platform digitizes parameter transmission and automates deployment, allowing users to select components and resources, with administrators confirming allocations and the system executing standardized, policy‑compliant installations.
Scenario 2: Runtime Environment Management
Manages CPU, memory, IP, ports, access relationships, scheduled tasks, backup policies, and auto‑start services, replacing spreadsheet‑based tracking with automated configuration.
Scenario 3: Continuous Deployment Management
Standardizes version delivery, environment‑specific configuration libraries, multi‑node installation orchestration, unified alerting, and automated rollback procedures.
Scenario 4: Runtime Environment Maintenance
Integrates common operational tools for application restart, health checks, isolation, recovery, physical server testing, OpenStack integration, network device health, and periodic security inspections.
Scenario 5: Application Portrait
Aggregates architectural, version, parameter, maintenance, capacity, and high‑availability data to generate a comprehensive application profile and maturity assessment.
Technical Solution
Hardware status is collected via SNMP; virtual resources are managed through OpenStack APIs; a custom scheduler controls Linux and application operations.
The platform is container‑first, with front‑end services (Apache, HAProxy, Keepalived) and back‑end components (JBoss, RabbitMQ, Ansible, Zookeeper). Data stores include MySQL, Redis, and Ceph, plus a security module for high‑risk operation checks.
Business Flow Technology
Operations requests are packaged as messages, placed on a queue, and processed by the scheduler, which assigns tasks to Ansible nodes that execute via SSH and return results asynchronously.
Scheduler Algorithm and Distributed Ansible Architecture
The scheduler tags messages by IP‑derived region, splits workloads per tag, and distributes them to Ansible workers. Each task carries a unique ID, enabling concurrent asynchronous execution.
Data Visualization
Collectors gather metrics, synchronizers pull data from external platforms, and a core database stores them. Threshold engines generate alerts, analysis functions produce performance reports, and visual dashboards present the results.
Platform Achievements
The platform now manages thousands of virtual servers, provides data center and rack views, auto‑generates topology from SNMP, OpenStack, and Ansible data, supports fine‑grained permission management, customizable data synchronization, self‑service backup, startup item, and scheduled task management.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.