Operations 13 min read

How a Banking Card Organization Built a Scalable Cloud Operations Platform

This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.

Efficient Ops
Efficient Ops
Efficient Ops
How a Banking Card Organization Built a Scalable Cloud Operations Platform

Preface

Operations teams typically progress from standardization to tool‑based automation, and finally to intelligent platforms that consolidate tools, capture expertise, and enable self‑service.

Overview of the Banking Card Organization Cloud Operations Platform

The IT system was initially built on ITIL processes for change, incident, problem, and service management. Five years ago it evolved into a cloud‑native operations platform with an IaaS virtualization layer and a CMDB for unified data management.

Three primary needs drove the redesign:

Rapidly increasing hardware nodes required an efficient, adaptable automation platform to reduce repetitive work.

Capturing operators’ knowledge into a reusable, intelligent scenario library to improve quality and efficiency.

Embedding intelligence into traditionally manual workflow decisions to shift from human judgment to machine‑assisted analysis.

The resulting platform operates in a cloud environment and addresses these needs.

Key Product Features

The platform provides four major capabilities:

Unified resource scheduling across OpenStack, database, container, storage, network, and security services, exposing a single API for self‑service operations.

Comprehensive automation for data collection, application installation, configuration, updates, analysis, scaling, backup, and recovery.

Multi‑dimensional visualizations offering role‑specific views (network, system, monitoring, reporting) and customizable dashboards.

High‑performance handling of tens of thousands of nodes concurrently.

Platform Construction Scenarios

Core modules include execution, data collection, and integration with other processes, forming an "Operations OS" that manages automated topology, custom reports, and full lifecycle from deployment to decommission.

Scenario 1: Lifecycle Management

Traditional manual deployments involve lengthy hand‑offs and potential errors. The platform digitizes parameter transmission and automates deployment, allowing users to select components and resources, with administrators confirming allocations and the system executing standardized, policy‑compliant installations.

Scenario 2: Runtime Environment Management

Manages CPU, memory, IP, ports, access relationships, scheduled tasks, backup policies, and auto‑start services, replacing spreadsheet‑based tracking with automated configuration.

Scenario 3: Continuous Deployment Management

Standardizes version delivery, environment‑specific configuration libraries, multi‑node installation orchestration, unified alerting, and automated rollback procedures.

Scenario 4: Runtime Environment Maintenance

Integrates common operational tools for application restart, health checks, isolation, recovery, physical server testing, OpenStack integration, network device health, and periodic security inspections.

Scenario 5: Application Portrait

Aggregates architectural, version, parameter, maintenance, capacity, and high‑availability data to generate a comprehensive application profile and maturity assessment.

Technical Solution

Hardware status is collected via SNMP; virtual resources are managed through OpenStack APIs; a custom scheduler controls Linux and application operations.

The platform is container‑first, with front‑end services (Apache, HAProxy, Keepalived) and back‑end components (JBoss, RabbitMQ, Ansible, Zookeeper). Data stores include MySQL, Redis, and Ceph, plus a security module for high‑risk operation checks.

Business Flow Technology

Operations requests are packaged as messages, placed on a queue, and processed by the scheduler, which assigns tasks to Ansible nodes that execute via SSH and return results asynchronously.

Scheduler Algorithm and Distributed Ansible Architecture

The scheduler tags messages by IP‑derived region, splits workloads per tag, and distributes them to Ansible workers. Each task carries a unique ID, enabling concurrent asynchronous execution.

Data Visualization

Collectors gather metrics, synchronizers pull data from external platforms, and a core database stores them. Threshold engines generate alerts, analysis functions produce performance reports, and visual dashboards present the results.

Platform Achievements

The platform now manages thousands of virtual servers, provides data center and rack views, auto‑generates topology from SNMP, OpenStack, and Ansible data, supports fine‑grained permission management, customizable data synchronization, self‑service backup, startup item, and scheduled task management.

MonitoringAutomationplatform architecturecloud operationsservice orchestrationoperations management
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.