How to Build a Visualized Distributed Ops Platform for Cloud Environments
This article details the design and implementation of a visualized, automated operations platform that integrates inspection, job scheduling, configuration management with SaltStack, data lifecycle automation, and real‑time big‑data analytics to improve efficiency, reliability, and agility of cloud‑based IT services.
1. Automated Inspection
Since 2011 the team used manual Word documents, Excel sheets and ad‑hoc shell scripts for health checks, which were inefficient and limited to basic host, database and middleware status. In December 2014 an automated inspection tool was introduced, consolidating existing scripts into a visual interface that runs scheduled checks, generates reports, and covers hosts, databases, middleware, and applications. By 2015 business‑level checks were added, expanding to ten categories and 28 items, enabling continuous monitoring of critical services.
The inspection tool classifies scripts, allows customisation of inspection items, visualises all actions, and automatically sends reports via email or SMS.
2. Automated Jobs
Routine operations such as host, middleware and database maintenance are repetitive and error‑prone. The automation platform turns script‑heavy tasks into click‑through scenarios with parameter input, dramatically reducing manual effort.
Typical workflow:
Administrator configures tasks in the task‑configuration page, assigning them to categories.
Configured tasks appear in the task view; the user selects a task, chooses target devices, inputs parameters, and clicks “Execute”.
3. Automated Visualization Deployment
To cope with the exponential growth of machines after cloud migration, the team built a web visualisation layer on top of SaltStack, which lacks a native UI. SaltStack provides centralized configuration management, remote execution and monitoring for thousands of servers.
Architecture:
Installation uses source compilation; the master runs on a RHEL 6.5 host, and minions are installed on all managed nodes. Authentication is certificate‑based, with master IP defined in /etc/salt/master and minion ID in /etc/salt/minion. Configuration follows YAML key: value syntax.
cat /etc/salt/master.d/nodegroup.conf
nodegroups:
redhatDatabase: 'redhat-db'
redhatAPP: 'redhat-app'
suseAPP: 'suse-app'
suseDatabase: 'suse-db'Grains are used to query node information:
salt 'redhat-db1' grains.ls # list grain categories
salt 'redhat-db1' grains.items # show all grain data
salt 'redhat-db1' grains.item osrelease # specific grain4. Automated Data Management
In cloud environments the number of databases and the volume of data (potentially exabytes) make manual data management infeasible. The team designed an automated data‑lifecycle platform that defines policies for data retention, migration, and cleanup, and executes them without human intervention.
Policy management: define lifecycle, migration, and cleanup policies; schedule jobs.
Data migration: automatically move data between storage tiers according to policies.
Data cleanup: configurable automatic or manual execution based on data importance.
5. Analysis – Real‑Time Big Data for Operations
The team built a pipeline using Flume, Kafka, Spark‑Streaming and Redis to collect, transport, process and store logs from hosts, middleware and applications. Data is ingested in real time, indexed in Elasticsearch, and persisted in MySQL for reporting.
Key analysis scenarios include:
Mapping user actions to business steps to identify performance bottlenecks across network, client, and server layers.
Understanding user behaviour for UI optimisation and targeted advertising.
Detecting intrusion attempts by correlating business‑level operations with network traffic.
6. Service‑Oriented Cloud Ops Architecture
Moving from ITIL‑centric processes to a service‑oriented model, the team reorganised operations into “service operation” and “resource operation” dimensions, built a resource‑pool management platform, and exposed infrastructure capabilities as services. This reduces the need for specialised expertise and enables a small team (3‑4 engineers) to manage thousands of servers.
Future work includes consolidating the disparate automation tools into a unified platform for coordinated operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
