Operations 38 min read

How I Cut 80% of Ops Time with an Automated Service Management System

This article details a complete automated operations framework that replaces manual service restarts, log cleaning, and deployment tasks with health‑checks, systemd units, Kubernetes probes, monitoring scripts, fault‑diagnosis tools, auto‑scaling policies, and Ansible playbooks, saving roughly 80% of repetitive work and dramatically improving reliability.

Raymond Ops
Raymond Ops
Raymond Ops
How I Cut 80% of Ops Time with an Automated Service Management System

Background and Pain Points

Manual operations in a typical production environment consume more than 70% of engineers' time and introduce high risk of human error. Common tasks include:

Service restart : 30‑40% of daily work, often required due to resource leaks or crashes.

Log cleanup : 15‑20% of work, needed when disks fill up.

Monitoring review : 20‑25% of time spent checking dashboards.

Fault investigation : 15‑20% of effort to diagnose alerts.

Deployment : 10‑15% of time for manual release steps.

These repetitive tasks lead to long on‑call shifts (12 h + days), average fault‑recovery time of 15 minutes, and frequent midnight wake‑ups.

Automation Benefits

Time savings : restart reduced from 5 min to 30 s; fault recovery from 20 min to 1 min; log management fully automated; scaling from 2 h to 5 min.

Reliability boost : human error rate drops from ~15% to <1%; success rate of automatic recovery rises above 99%.

Business continuity : service availability improves from 99.5% to 99.95%; MTTR drops from 15 min to 30 s; MTBF extends from 1 week to 1 month.

Architecture Overview

┌─────────────────────────────┐
               │   Monitoring & Alert Center   │
               │   (Prometheus / Zabbix)      │
               └──────────┬──────────────────┘
                        │ Metric collection
                        ↓
        ┌─────────────────────────────────────────────┐
        │            Automated Decision Engine         │
        │  - Health‑check evaluation                   │
        │  - Fault diagnosis                         │
        │  - Self‑healing policy selection            │
        └──────┬──────────────────────────────────────┘
               │ Trigger execution
               ↓
    ┌──────────────────────────────────────────────────┐
    │            Automation Execution Layer            │
    ├──────────┬──────────┬──────────┬─────────────────┤
    │ Auto‑restart │ Auto‑scale │ Auto‑deploy │ Auto‑backup │
    │ (Systemd/K8s)│ (HPA/VPA) │ (CI/CD)   │ (Scripts) │
    └──────────┴──────────┴──────────┴─────────────────┘
               │
               ↓
    ┌──────────────────────────────────────────────────┐
    │            Log Audit & Notification               │
    │  - Operation logs                               │
    │  - DingTalk / WeChat alerts                     │
    │  - Grafana visualisation                       │
    └──────────────────────────────────────────────────┘

Service Self‑Healing

Kubernetes health‑check configuration – Liveness, readiness and startup probes are defined in the deployment manifest. The probes call HTTP endpoints /health/liveness, /health/readiness and /health/startup respectively, with appropriate initialDelaySeconds, periodSeconds and failure thresholds.

# deployment-with-health-check.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-application
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: myregistry.com/web-app:v1.2.0
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health/liveness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/readiness
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        env:
        - name: JAVA_OPTS
          value: "-Xmx400m -Xms400m"
        restartPolicy: Always
        terminationGracePeriodSeconds: 30

Systemd service auto‑restart – For legacy VMs the following unit ensures the Java process is always restarted, limits resources, and logs to the journal.

# /etc/systemd/system/web-app.service
[Unit]
Description=Web Application Service
After=network.target mysql.service redis.service
Wants=mysql.service redis.service

[Service]
Type=simple
User=webapp
Group=webapp
WorkingDirectory=/opt/webapp
ExecStart=/usr/bin/java -jar /opt/webapp/app.jar --spring.profiles.active=production
ExecStartPre=/opt/scripts/pre-start-check.sh
ExecStop=/bin/kill -SIGTERM $MAINPID
Restart=always
RestartSec=10s
StartLimitInterval=300s
StartLimitBurst=5
StartLimitAction=reboot
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=30s
TimeoutStartSec=60s
LimitNOFILE=65535
LimitNPROC=4096
MemoryLimit=2G
CPUQuota=200%
NoNewPrivileges=true
PrivateTmp=true
StandardOutput=journal
StandardError=journal
SyslogIdentifier=webapp

[Install]
WantedBy=multi-user.target

Pre‑start health‑check script validates port availability, disk space, dependency reachability and configuration syntax before the service starts.

#!/bin/bash
# /opt/scripts/pre-start-check.sh
set -e
LOG_FILE="/var/log/webapp/pre-start-check.log"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"; }
# Example checks (port, disk, MySQL, Redis, config file) …
log "All pre‑start checks passed"
exit 0

Custom Monitoring Script (Service Guard)

A Bash daemon monitors the process, performs HTTP health checks, limits restart frequency, and sends alerts via a configurable webhook (e.g., WeChat or DingTalk).

#!/bin/bash
SERVICE_NAME="web-app"
PROCESS_PATTERN="java.*app.jar"
START_COMMAND="/opt/scripts/start-webapp.sh"
STOP_COMMAND="/opt/scripts/stop-webapp.sh"
HEALTH_CHECK_URL="http://localhost:8080/health"
CHECK_INTERVAL=30
MAX_RESTART_PER_HOUR=5
ALERT_WEBHOOK="https://example.com/webhook"
# Core functions: log, send_alert, is_process_running, http_health_check, check_restart_limit, stop_service, start_service, restart_service
# Main loop runs forever, invoking deep_health_check and restarting when needed.

Automated Fault Diagnosis

A Python 3 script collects system state, process information, resource usage, network status, dependency health, JVM metrics (if applicable), and relevant logs. The data is stored as JSON and a human‑readable text report under /var/log/diagnostics/<service_name>.

#!/usr/bin/env python3
"""Service fault automatic diagnosis script"""
import os, json, subprocess, datetime
class ServiceDiagnostics:
    def __init__(self, service_name, process_pattern, log_paths):
        self.service_name = service_name
        self.process_pattern = process_pattern
        self.log_paths = log_paths
        self.report_dir = f"/var/log/diagnostics/{service_name}"
        os.makedirs(self.report_dir, exist_ok=True)
    def run_command(self, cmd, timeout=30):
        try:
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
            return result.stdout if result.returncode == 0 else result.stderr
        except Exception as e:
            return f"Command failed: {e}"
    # Methods: check_process_status, check_system_resources, check_network_status, collect_logs, check_dependencies, analyze_jvm, generate_report
    # generate_report() writes JSON and a formatted text file.

def main():
    diag = ServiceDiagnostics(
        service_name="web-app",
        process_pattern="java.*app.jar",
        log_paths=["/opt/webapp/logs/app.log", "/opt/webapp/logs/error.log"]
    )
    diag.generate_report()
if __name__ == "__main__":
    main()

Auto‑Scaling Solutions

Horizontal Pod Autoscaler (HPA) – Scales the deployment between 3 and 20 replicas based on CPU, memory, QPS and P99 latency custom metrics.

# hpa-configuration.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-application
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 5
        periodSeconds: 60
      selectPolicy: Max
  metrics:
  - type: Resource
    resource:
      name: cpu
    target:
      type: Utilization
      averageUtilization: 70
  - type: Resource
    resource:
      name: memory
    target:
      type: Utilization
      averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  - type: Pods
    pods:
      metric:
        name: http_request_duration_p99
      target:
        type: AverageValue
        averageValue: "500m"

Prometheus‑Adapter custom metrics – Exposes QPS and latency as per‑second and P99 metrics for the HPA.

# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{namespace="production",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'
    - seriesQuery: 'http_request_duration_seconds{namespace="production",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_seconds"
        as: "${1}_p99"
      metricsQuery: 'histogram_quantile(0.99, sum(rate(<<.Series>>_bucket{<<.LabelMatchers>>}[1m])) by (le, <<.GroupBy>>)'

Vertical Pod Autoscaler (VPA) – Adjusts CPU and memory requests for the deployment within defined bounds.

# vpa-configuration.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-application
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: web-app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 2Gi
      controlledResources:
      - cpu
      - memory
      controlledValues: RequestsAndLimits

Complete Ansible Playbook for Deployment

The playbook provisions script directories, installs required tools, creates a monitoring user, deploys the monitoring and diagnostic scripts, configures a systemd service for the monitor, sets up log‑cleanup and disk‑monitor cron jobs, and ensures the monitor service is started and enabled.

# automated-ops-playbook.yaml
---
- name: Deploy automated operations system
  hosts: all
  become: yes
  vars:
    ops_scripts_dir: /opt/ops-scripts
    monitor_user: opsmonitor
    log_dir: /var/log/auto-ops
  tasks:
    - name: Create script directories
      file:
        path: "{{ item }}"
        state: directory
        mode: '0755'
      loop:
        - "{{ ops_scripts_dir }}"
        - "{{ log_dir }}"
        - /var/log/diagnostics
    - name: Install required tools
      yum:
        name:
          - python3
          - python3-pip
          - sysstat
          - nethogs
          - dstat
          - bc
        state: present
    - name: Create monitoring user
      user:
        name: "{{ monitor_user }}"
        shell: /bin/bash
        createhome: yes
        system: yes
    - name: Deploy service monitor script
      template:
        src: service_monitor.sh.j2
        dest: "{{ ops_scripts_dir }}/service_monitor.sh"
        mode: '0755'
        owner: "{{ monitor_user }}"
    - name: Deploy diagnostic script
      copy:
        src: service_diagnostics.py
        dest: "{{ ops_scripts_dir }}/service_diagnostics.py"
        mode: '0755'
        owner: "{{ monitor_user }}"
    - name: Create systemd unit for monitor
      template:
        src: service-monitor.service.j2
        dest: /etc/systemd/system/service-monitor.service
        mode: '0644'
        notify: reload systemd
    - name: Deploy log cleanup script
      copy:
        dest: "{{ ops_scripts_dir }}/log_cleanup.sh"
        mode: '0755'
        content: |
          #!/bin/bash
          find /var/log -name "*.log" -type f -mtime +7 -delete
          find {{ log_dir }} -name "*.log" -type f -mtime +7 -delete
    - name: Schedule log cleanup (2:00 AM daily)
      cron:
        name: "Log cleanup"
        minute: "0"
        hour: "2"
        job: "{{ ops_scripts_dir }}/log_cleanup.sh"
        user: root
    - name: Deploy disk monitor script
      copy:
        dest: "{{ ops_scripts_dir }}/disk_monitor.sh"
        mode: '0755'
        content: |
          #!/bin/bash
          THRESHOLD=85
          WEBHOOK="https://example.com/webhook"
          while read line; do
            USAGE=$(echo $line | awk '{print $5}' | sed 's/%//')
            MOUNT=$(echo $line | awk '{print $6}')
            if [ "$USAGE" -gt "$THRESHOLD" ]; then
              curl -s -X POST "$WEBHOOK" -H 'Content-Type: application/json' -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"Disk alert\
Host: $(hostname)\
Mount: $MOUNT\
Usage: ${USAGE}%\"}}"
              if [ "$MOUNT" == "/" ]; then
                find /tmp -type f -atime +7 -delete
                find /var/tmp -type f -atime +7 -delete
              fi
            fi
          done < <(df -h | grep -vE '^Filesystem|tmpfs|cdrom')
    - name: Schedule disk monitor (every 30 minutes)
      cron:
        name: "Disk space monitor"
        minute: "*/30"
        job: "{{ ops_scripts_dir }}/disk_monitor.sh"
        user: root
    - name: Start and enable monitor service
      systemd:
        name: service-monitor
        state: started
        enabled: yes
        daemon_reload: yes
  handlers:
    - name: reload systemd
      systemd:
        daemon_reload: yes

Case Study – E‑commerce Platform Automation

Environment: 80 cloud VMs, 20+ micro‑services on Docker/Kubernetes, MySQL master‑slave, Redis cluster, operated by a three‑person team.

Implementation was divided into four phases:

Service self‑healing (weeks 1‑2) : migrated services to Kubernetes, added liveness/readiness probes, configured Systemd auto‑restart for legacy VMs.

Auto‑scaling (weeks 3‑4) : deployed HPA/VPA, performed load testing and tuned thresholds.

Fault diagnosis (weeks 5‑6) : rolled out the monitoring daemon and diagnostic script, integrated alerts.

Log & backup automation (weeks 7‑8) : scheduled log rotation, daily DB backups, and weekly restore validation.

Key results after automation:

Service restart time: 5 min → 30 s (90% saving)
Fault recovery time: 20 min → 1 min (95% saving)
Log management: fully automated (100% saving)
Scaling operation: 2 h → 3 min (≈97% saving)
Backup operation: fully automated (100% saving)
Daily ops workload: 12 h → 2.5 h (≈79% reduction)
Person‑hours saved: ~627 h/month
ROI achieved within half a month

Conclusion and Outlook

Core Takeaways

Start automation with low‑risk, high‑impact tasks (restart, log cleanup, health checks).

Maintain solid monitoring as the foundation for any automated decision.

Retain manual fallback paths for critical failures.

Iterate gradually: implement, test in staging, then roll out to production.

Future Directions

AIOps : machine‑learning based anomaly detection, automated root‑cause analysis and predictive maintenance.

GitOps : declarative infrastructure and configuration stored in Git, continuous delivery via pull‑request workflows.

Serverless runtimes : eliminate operational overhead for auxiliary tasks such as log processing or alert aggregation.

Chaos engineering : inject controlled failures to validate self‑healing mechanisms and improve system resilience.

Advice for Operations Engineers

Embrace automation – treat it as a productivity multiplier, not a threat.

Continuously learn container, Kubernetes and cloud‑native technologies.

Develop strong scripting skills (Shell, Python) to build reliable tooling.

Adopt a systems‑thinking mindset; design end‑to‑end automated flows.

Let data drive optimisation – use metrics to refine thresholds and policies.

MonitoringAutomationOpssystemd
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.