How I Cut 80% of Ops Time with an Automated Service Management System
This article details a complete automated operations framework that replaces manual service restarts, log cleaning, and deployment tasks with health‑checks, systemd units, Kubernetes probes, monitoring scripts, fault‑diagnosis tools, auto‑scaling policies, and Ansible playbooks, saving roughly 80% of repetitive work and dramatically improving reliability.
Background and Pain Points
Manual operations in a typical production environment consume more than 70% of engineers' time and introduce high risk of human error. Common tasks include:
Service restart : 30‑40% of daily work, often required due to resource leaks or crashes.
Log cleanup : 15‑20% of work, needed when disks fill up.
Monitoring review : 20‑25% of time spent checking dashboards.
Fault investigation : 15‑20% of effort to diagnose alerts.
Deployment : 10‑15% of time for manual release steps.
These repetitive tasks lead to long on‑call shifts (12 h + days), average fault‑recovery time of 15 minutes, and frequent midnight wake‑ups.
Automation Benefits
Time savings : restart reduced from 5 min to 30 s; fault recovery from 20 min to 1 min; log management fully automated; scaling from 2 h to 5 min.
Reliability boost : human error rate drops from ~15% to <1%; success rate of automatic recovery rises above 99%.
Business continuity : service availability improves from 99.5% to 99.95%; MTTR drops from 15 min to 30 s; MTBF extends from 1 week to 1 month.
Architecture Overview
┌─────────────────────────────┐
│ Monitoring & Alert Center │
│ (Prometheus / Zabbix) │
└──────────┬──────────────────┘
│ Metric collection
↓
┌─────────────────────────────────────────────┐
│ Automated Decision Engine │
│ - Health‑check evaluation │
│ - Fault diagnosis │
│ - Self‑healing policy selection │
└──────┬──────────────────────────────────────┘
│ Trigger execution
↓
┌──────────────────────────────────────────────────┐
│ Automation Execution Layer │
├──────────┬──────────┬──────────┬─────────────────┤
│ Auto‑restart │ Auto‑scale │ Auto‑deploy │ Auto‑backup │
│ (Systemd/K8s)│ (HPA/VPA) │ (CI/CD) │ (Scripts) │
└──────────┴──────────┴──────────┴─────────────────┘
│
↓
┌──────────────────────────────────────────────────┐
│ Log Audit & Notification │
│ - Operation logs │
│ - DingTalk / WeChat alerts │
│ - Grafana visualisation │
└──────────────────────────────────────────────────┘Service Self‑Healing
Kubernetes health‑check configuration – Liveness, readiness and startup probes are defined in the deployment manifest. The probes call HTTP endpoints /health/liveness, /health/readiness and /health/startup respectively, with appropriate initialDelaySeconds, periodSeconds and failure thresholds.
# deployment-with-health-check.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-application
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: myregistry.com/web-app:v1.2.0
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
env:
- name: JAVA_OPTS
value: "-Xmx400m -Xms400m"
restartPolicy: Always
terminationGracePeriodSeconds: 30Systemd service auto‑restart – For legacy VMs the following unit ensures the Java process is always restarted, limits resources, and logs to the journal.
# /etc/systemd/system/web-app.service
[Unit]
Description=Web Application Service
After=network.target mysql.service redis.service
Wants=mysql.service redis.service
[Service]
Type=simple
User=webapp
Group=webapp
WorkingDirectory=/opt/webapp
ExecStart=/usr/bin/java -jar /opt/webapp/app.jar --spring.profiles.active=production
ExecStartPre=/opt/scripts/pre-start-check.sh
ExecStop=/bin/kill -SIGTERM $MAINPID
Restart=always
RestartSec=10s
StartLimitInterval=300s
StartLimitBurst=5
StartLimitAction=reboot
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=30s
TimeoutStartSec=60s
LimitNOFILE=65535
LimitNPROC=4096
MemoryLimit=2G
CPUQuota=200%
NoNewPrivileges=true
PrivateTmp=true
StandardOutput=journal
StandardError=journal
SyslogIdentifier=webapp
[Install]
WantedBy=multi-user.targetPre‑start health‑check script validates port availability, disk space, dependency reachability and configuration syntax before the service starts.
#!/bin/bash
# /opt/scripts/pre-start-check.sh
set -e
LOG_FILE="/var/log/webapp/pre-start-check.log"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"; }
# Example checks (port, disk, MySQL, Redis, config file) …
log "All pre‑start checks passed"
exit 0Custom Monitoring Script (Service Guard)
A Bash daemon monitors the process, performs HTTP health checks, limits restart frequency, and sends alerts via a configurable webhook (e.g., WeChat or DingTalk).
#!/bin/bash
SERVICE_NAME="web-app"
PROCESS_PATTERN="java.*app.jar"
START_COMMAND="/opt/scripts/start-webapp.sh"
STOP_COMMAND="/opt/scripts/stop-webapp.sh"
HEALTH_CHECK_URL="http://localhost:8080/health"
CHECK_INTERVAL=30
MAX_RESTART_PER_HOUR=5
ALERT_WEBHOOK="https://example.com/webhook"
# Core functions: log, send_alert, is_process_running, http_health_check, check_restart_limit, stop_service, start_service, restart_service
# Main loop runs forever, invoking deep_health_check and restarting when needed.Automated Fault Diagnosis
A Python 3 script collects system state, process information, resource usage, network status, dependency health, JVM metrics (if applicable), and relevant logs. The data is stored as JSON and a human‑readable text report under /var/log/diagnostics/<service_name>.
#!/usr/bin/env python3
"""Service fault automatic diagnosis script"""
import os, json, subprocess, datetime
class ServiceDiagnostics:
def __init__(self, service_name, process_pattern, log_paths):
self.service_name = service_name
self.process_pattern = process_pattern
self.log_paths = log_paths
self.report_dir = f"/var/log/diagnostics/{service_name}"
os.makedirs(self.report_dir, exist_ok=True)
def run_command(self, cmd, timeout=30):
try:
result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
return result.stdout if result.returncode == 0 else result.stderr
except Exception as e:
return f"Command failed: {e}"
# Methods: check_process_status, check_system_resources, check_network_status, collect_logs, check_dependencies, analyze_jvm, generate_report
# generate_report() writes JSON and a formatted text file.
def main():
diag = ServiceDiagnostics(
service_name="web-app",
process_pattern="java.*app.jar",
log_paths=["/opt/webapp/logs/app.log", "/opt/webapp/logs/error.log"]
)
diag.generate_report()
if __name__ == "__main__":
main()Auto‑Scaling Solutions
Horizontal Pod Autoscaler (HPA) – Scales the deployment between 3 and 20 replicas based on CPU, memory, QPS and P99 latency custom metrics.
# hpa-configuration.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-application
minReplicas: 3
maxReplicas: 20
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Max
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
- type: Pods
pods:
metric:
name: http_request_duration_p99
target:
type: AverageValue
averageValue: "500m"Prometheus‑Adapter custom metrics – Exposes QPS and latency as per‑second and P99 metrics for the HPA.
# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_total{namespace="production",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'
- seriesQuery: 'http_request_duration_seconds{namespace="production",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_seconds"
as: "${1}_p99"
metricsQuery: 'histogram_quantile(0.99, sum(rate(<<.Series>>_bucket{<<.LabelMatchers>>}[1m])) by (le, <<.GroupBy>>)'Vertical Pod Autoscaler (VPA) – Adjusts CPU and memory requests for the deployment within defined bounds.
# vpa-configuration.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-application
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: web-app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 2Gi
controlledResources:
- cpu
- memory
controlledValues: RequestsAndLimitsComplete Ansible Playbook for Deployment
The playbook provisions script directories, installs required tools, creates a monitoring user, deploys the monitoring and diagnostic scripts, configures a systemd service for the monitor, sets up log‑cleanup and disk‑monitor cron jobs, and ensures the monitor service is started and enabled.
# automated-ops-playbook.yaml
---
- name: Deploy automated operations system
hosts: all
become: yes
vars:
ops_scripts_dir: /opt/ops-scripts
monitor_user: opsmonitor
log_dir: /var/log/auto-ops
tasks:
- name: Create script directories
file:
path: "{{ item }}"
state: directory
mode: '0755'
loop:
- "{{ ops_scripts_dir }}"
- "{{ log_dir }}"
- /var/log/diagnostics
- name: Install required tools
yum:
name:
- python3
- python3-pip
- sysstat
- nethogs
- dstat
- bc
state: present
- name: Create monitoring user
user:
name: "{{ monitor_user }}"
shell: /bin/bash
createhome: yes
system: yes
- name: Deploy service monitor script
template:
src: service_monitor.sh.j2
dest: "{{ ops_scripts_dir }}/service_monitor.sh"
mode: '0755'
owner: "{{ monitor_user }}"
- name: Deploy diagnostic script
copy:
src: service_diagnostics.py
dest: "{{ ops_scripts_dir }}/service_diagnostics.py"
mode: '0755'
owner: "{{ monitor_user }}"
- name: Create systemd unit for monitor
template:
src: service-monitor.service.j2
dest: /etc/systemd/system/service-monitor.service
mode: '0644'
notify: reload systemd
- name: Deploy log cleanup script
copy:
dest: "{{ ops_scripts_dir }}/log_cleanup.sh"
mode: '0755'
content: |
#!/bin/bash
find /var/log -name "*.log" -type f -mtime +7 -delete
find {{ log_dir }} -name "*.log" -type f -mtime +7 -delete
- name: Schedule log cleanup (2:00 AM daily)
cron:
name: "Log cleanup"
minute: "0"
hour: "2"
job: "{{ ops_scripts_dir }}/log_cleanup.sh"
user: root
- name: Deploy disk monitor script
copy:
dest: "{{ ops_scripts_dir }}/disk_monitor.sh"
mode: '0755'
content: |
#!/bin/bash
THRESHOLD=85
WEBHOOK="https://example.com/webhook"
while read line; do
USAGE=$(echo $line | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo $line | awk '{print $6}')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
curl -s -X POST "$WEBHOOK" -H 'Content-Type: application/json' -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"Disk alert\
Host: $(hostname)\
Mount: $MOUNT\
Usage: ${USAGE}%\"}}"
if [ "$MOUNT" == "/" ]; then
find /tmp -type f -atime +7 -delete
find /var/tmp -type f -atime +7 -delete
fi
fi
done < <(df -h | grep -vE '^Filesystem|tmpfs|cdrom')
- name: Schedule disk monitor (every 30 minutes)
cron:
name: "Disk space monitor"
minute: "*/30"
job: "{{ ops_scripts_dir }}/disk_monitor.sh"
user: root
- name: Start and enable monitor service
systemd:
name: service-monitor
state: started
enabled: yes
daemon_reload: yes
handlers:
- name: reload systemd
systemd:
daemon_reload: yesCase Study – E‑commerce Platform Automation
Environment: 80 cloud VMs, 20+ micro‑services on Docker/Kubernetes, MySQL master‑slave, Redis cluster, operated by a three‑person team.
Implementation was divided into four phases:
Service self‑healing (weeks 1‑2) : migrated services to Kubernetes, added liveness/readiness probes, configured Systemd auto‑restart for legacy VMs.
Auto‑scaling (weeks 3‑4) : deployed HPA/VPA, performed load testing and tuned thresholds.
Fault diagnosis (weeks 5‑6) : rolled out the monitoring daemon and diagnostic script, integrated alerts.
Log & backup automation (weeks 7‑8) : scheduled log rotation, daily DB backups, and weekly restore validation.
Key results after automation:
Service restart time: 5 min → 30 s (90% saving)
Fault recovery time: 20 min → 1 min (95% saving)
Log management: fully automated (100% saving)
Scaling operation: 2 h → 3 min (≈97% saving)
Backup operation: fully automated (100% saving)
Daily ops workload: 12 h → 2.5 h (≈79% reduction)
Person‑hours saved: ~627 h/month
ROI achieved within half a monthConclusion and Outlook
Core Takeaways
Start automation with low‑risk, high‑impact tasks (restart, log cleanup, health checks).
Maintain solid monitoring as the foundation for any automated decision.
Retain manual fallback paths for critical failures.
Iterate gradually: implement, test in staging, then roll out to production.
Future Directions
AIOps : machine‑learning based anomaly detection, automated root‑cause analysis and predictive maintenance.
GitOps : declarative infrastructure and configuration stored in Git, continuous delivery via pull‑request workflows.
Serverless runtimes : eliminate operational overhead for auxiliary tasks such as log processing or alert aggregation.
Chaos engineering : inject controlled failures to validate self‑healing mechanisms and improve system resilience.
Advice for Operations Engineers
Embrace automation – treat it as a productivity multiplier, not a threat.
Continuously learn container, Kubernetes and cloud‑native technologies.
Develop strong scripting skills (Shell, Python) to build reliable tooling.
Adopt a systems‑thinking mindset; design end‑to‑end automated flows.
Let data drive optimisation – use metrics to refine thresholds and policies.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
