Operations 37 min read

How Automated Ops Cut Service Restarts by 80% and Save Hours Daily

Discover a comprehensive automated operations framework that eliminates manual service restarts, reduces repetitive tasks by 80%, accelerates fault recovery from minutes to seconds, and boosts reliability through health checks, Kubernetes self‑healing, Systemd scripts, monitoring, and scalable deployment strategies.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How Automated Ops Cut Service Restarts by 80% and Save Hours Daily

Introduction

Two years ago the author was routinely woken up at 3 am to manually restart failing services, spending most of the day on repetitive restarts, log cleanup, and backups. After implementing a full automation system, 99% of routine operations are now handled automatically, reducing daily work from 12 hours to under 3 hours and cutting fault‑recovery time from an average of 15 minutes to 30 seconds.

Manual Ops Pain Points

Core Pain Points

Service restarts: 30‑40% of time

Log cleanup: 15‑20% of time

Monitoring checks: 20‑25% of time

Fault diagnosis: 15‑20% of time

Deployments: 10‑15% of time

More than 70% of work is repetitive, low‑value tasks. Human error risk is high, with 60% of production incidents caused by manual mistakes, especially manual restarts. Response speed is slow: a typical incident chain takes 10‑30 minutes, which is unacceptable for high‑availability services.

Automation Benefits

Time Savings : Service restart from 5 minutes to 30 seconds; fault recovery from 15 minutes to 1 minute; log cleanup automated.

Reliability : Human error rate drops from 15% to <1%; fault‑recovery success rises from 85% to >99%.

Business Continuity : Service availability improves from 99.5% to 99.95%; MTTR from 15 minutes to 30 seconds; MTBF from 1 week to 1 month.

Automation Architecture

┌─────────────────────────────┐
               │   Monitoring & Alert Center │
               │   (Prometheus/Zabbix)       │
               └──────────┬──────────────────┘
                        │ Metrics collection
                        ▼
   ┌─────────────────────────────────────────────┐
   │          Automated Decision Engine          │
   │  - Health check evaluation                  │
   │  - Fault diagnosis analysis                │
   │  - Self‑healing strategy selection          │
   └──────┬──────────────────────────────────────┘
          │ Trigger execution
          ▼
   ┌──────────────────────────────────────────────────┐
   │               Automated Execution Layer            │
   ├──────────┬──────────┬──────────┬─────────────────┤
   │ Auto‑restart │ Auto‑scale │ Auto‑deploy │ Auto‑backup │
   │ (Systemd/K8s)│ (HPA/VPA) │ (CI/CD)   │ (Script/Ansible) │
   └──────────┴──────────┴──────────┴─────────────────┘
          │
          ▼
   ┌──────────────────────────────────────────────────┐
   │               Log Audit & Notification            │
   │  - Operation logs storage                         │
   │  - DingTalk/WeChat alerts                        │
   │  - Grafana visualization                        │
   └──────────────────────────────────────────────────┘

Self‑Healing Mechanism (Kubernetes)

Deploy health probes in the pod spec to let Kubernetes automatically restart unhealthy containers.

# deployment-with-health-check.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-application
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: myregistry.com/web-app:v1.2.0
        ports:
        - containerPort: 8080
          name: http
        livenessProbe:
          httpGet:
            path: /health/liveness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/readiness
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 30
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        env:
        - name: JAVA_OPTS
          value: "-Xmx400m -Xms400m"
        restartPolicy: Always
        terminationGracePeriodSeconds: 30

Spring Boot health endpoints can be used for the probes:

// HealthCheckController.java
@RestController
@RequestMapping("/health")
public class HealthCheckController {
    @Autowired private DataSource dataSource;
    @Autowired private RedisTemplate redisTemplate;
    @GetMapping("/liveness")
    public ResponseEntity<Map<String,String>> liveness(){
        Map<String,String> result = new HashMap<>();
        result.put("status","UP");
        result.put("timestamp", LocalDateTime.now().toString());
        return ResponseEntity.ok(result);
    }
    @GetMapping("/readiness")
    public ResponseEntity<Map<String,Object>> readiness(){
        Map<String,Object> result = new HashMap<>();
        boolean isReady = true;
        try{ dataSource.getConnection().close(); result.put("database","UP"); }
        catch(Exception e){ result.put("database","DOWN"); isReady = false; }
        try{ redisTemplate.opsForValue().get("health_check"); result.put("redis","UP"); }
        catch(Exception e){ result.put("redis","DOWN"); isReady = false; }
        result.put("status", isReady?"UP":"DOWN");
        return isReady? ResponseEntity.ok(result) : ResponseEntity.status(503).body(result);
    }
    @GetMapping("/startup")
    public ResponseEntity<Map<String,String>> startup(){
        if(ApplicationContext.isInitialized()){
            Map<String,String> result = new HashMap<>();
            result.put("status","UP");
            return ResponseEntity.ok(result);
        }
        return ResponseEntity.status(503).body(Map.of("status","DOWN"));
    }
}

Systemd Auto‑Restart for Legacy Services

# /etc/systemd/system/web-app.service
[Unit]
Description=Web Application Service
After=network.target mysql.service redis.service
Wants=mysql.service redis.service

[Service]
Type=simple
User=webapp
Group=webapp
WorkingDirectory=/opt/webapp
ExecStart=/usr/bin/java -jar /opt/webapp/app.jar --spring.profiles.active=production
Restart=always
RestartSec=10s
StartLimitInterval=300s
StartLimitBurst=5
StartLimitAction=reboot
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=30s
TimeoutStartSec=60s
LimitNOFILE=65535
LimitNPROC=4096
MemoryLimit=2G
CPUQuota=200%
NoNewPrivileges=true
PrivateTmp=true
StandardOutput=journal
StandardError=journal
SyslogIdentifier=webapp

[Install]
WantedBy=multi-user.target

Custom Monitoring Script (Bash)

#!/bin/bash
# /opt/scripts/service_monitor.sh
SERVICE_NAME="web-app"
PROCESS_PATTERN="java.*app.jar"
CHECK_INTERVAL=30
RESTART_DELAY=10
MAX_RESTART_PER_HOUR=5
LOG_FILE="/var/log/service-monitor.log"
ALERT_WEBHOOK="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
ALERT_ENABLED=true
# ... (functions for logging, health checks, restart logic) ...
main(){
  echo "========== Service Monitor Started =========="
  while true; do
    if ! deep_health_check; then
      restart_service
    fi
    sleep $CHECK_INTERVAL
done
}
main "$@"

Automatic Fault Diagnosis (Python)

#!/usr/bin/env python3
"""Service diagnostics script that collects process info, system resources, network status, logs, and JVM metrics, then writes a JSON and human‑readable report."""
import os, json, subprocess, datetime
class ServiceDiagnostics:
    def __init__(self, service_name, process_pattern, log_paths):
        self.service_name = service_name
        self.process_pattern = process_pattern
        self.log_paths = log_paths
        self.report_dir = f"/var/log/diagnostics/{service_name}"
        self.timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        os.makedirs(self.report_dir, exist_ok=True)
    def run_command(self, cmd, timeout=30):
        try:
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
            return result.stdout if result.returncode==0 else result.stderr
        except Exception as e:
            return f"Command failed: {e}"
    # ... (methods for process status, system resources, network, logs, dependencies, JVM analysis) ...
    def generate_report(self):
        report = {
            "service_name": self.service_name,
            "timestamp": datetime.datetime.now().isoformat(),
            "hostname": os.uname().nodename,
            "diagnostics": {}
        }
        report["diagnostics"]["process_status"] = self.check_process_status()
        report["diagnostics"]["system_resources"] = self.check_system_resources()
        report["diagnostics"]["network_status"] = self.check_network_status()
        report["diagnostics"]["dependencies"] = self.check_dependencies()
        report["diagnostics"]["logs"] = self.collect_logs()
        pid = report["diagnostics"]["process_status"].get("pid")
        if pid:
            report["diagnostics"]["jvm_analysis"] = self.analyze_jvm(pid)
        json_path = f"{self.report_dir}/diagnosis_{self.timestamp}.json"
        with open(json_path, "w", encoding="utf-8") as f:
            json.dump(report, f, indent=2, ensure_ascii=False)
        print(f"Diagnosis complete! JSON report: {json_path}")
        return report
if __name__ == "__main__":
    diag = ServiceDiagnostics("web-app", "java.*app.jar", ["/opt/webapp/logs/app.log","/opt/webapp/logs/error.log"])
    diag.generate_report()

Automatic Scaling (Kubernetes HPA & VPA)

# hpa-configuration.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-application
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 5
        periodSeconds: 60
      selectPolicy: Max
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  - type: Pods
    pods:
      metric:
        name: http_request_duration_p99
      target:
        type: AverageValue
        averageValue: "500m"
# vpa-configuration.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-application
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: web-app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 2Gi
      controlledResources:
      - cpu
      - memory
      controlledValues: RequestsAndLimits

Ansible Playbook for Deployment

# automated-ops-playbook.yaml
---
- name: Deploy automated operations system
  hosts: all
  become: yes
  vars:
    ops_scripts_dir: /opt/ops-scripts
    monitor_user: opsmonitor
    log_dir: /var/log/auto-ops
  tasks:
    - name: Create script directories
      file:
        path: "{{ item }}"
        state: directory
        mode: '0755'
      loop:
        - "{{ ops_scripts_dir }}"
        - "{{ log_dir }}"
        - /var/log/diagnostics
    - name: Install required tools
      yum:
        name:
          - python3
          - python3-pip
          - sysstat
          - nethogs
          - dstat
          - bc
        state: present
    - name: Deploy service monitor script
      template:
        src: service_monitor.sh.j2
        dest: "{{ ops_scripts_dir }}/service_monitor.sh"
        mode: '0755'
        owner: "{{ monitor_user }}"
    - name: Deploy diagnostics script
      copy:
        src: service_diagnostics.py
        dest: "{{ ops_scripts_dir }}/service_diagnostics.py"
        mode: '0755'
        owner: "{{ monitor_user }}"
    - name: Create systemd unit for monitor
      template:
        src: service-monitor.service.j2
        dest: /etc/systemd/system/service-monitor.service
        mode: '0644'
        notify: reload systemd
    - name: Deploy log cleanup script
      copy:
        dest: "{{ ops_scripts_dir }}/log_cleanup.sh"
        mode: '0755'
        content: |
          #!/bin/bash
          # Clean logs older than 7 days
          find /var/log -name "*.log" -type f -mtime +7 -delete
          find {{ log_dir }} -name "*.log" -type f -mtime +7 -delete
          # Truncate large logs
          find /opt -name "*.log" -type f -size +1G -exec truncate -s 0 {} \;
          echo "[$(date)] Log cleanup done" >> {{ log_dir }}/cleanup.log
    - name: Schedule log cleanup (2 am daily)
      cron:
        name: "Log cleanup"
        minute: "0"
        hour: "2"
        job: "{{ ops_scripts_dir }}/log_cleanup.sh"
        user: root
    - name: Schedule daily health check
      cron:
        name: "Daily health check"
        minute: "0"
        hour: "8"
        job: "{{ ops_scripts_dir }}/daily_health_check.sh"
        user: root
    - name: Deploy disk monitor script
      copy:
        dest: "{{ ops_scripts_dir }}/disk_monitor.sh"
        mode: '0755'
        content: |
          #!/bin/bash
          THRESHOLD=85
          WEBHOOK="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"
          while read line; do
            USAGE=$(echo $line | awk '{print $5}' | sed 's/%//')
            MOUNT=$(echo $line | awk '{print $6}')
            if [ "$USAGE" -gt "$THRESHOLD" ]; then
              curl -s -X POST "$WEBHOOK" -H 'Content-Type: application/json' -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"Disk alert
Host: $(hostname)
Mount: $MOUNT
Usage: ${USAGE}%\"}}"
              if [ "$MOUNT" == "/" ]; then
                find /tmp -type f -atime +7 -delete
                find /var/tmp -type f -atime +7 -delete
              fi
            fi
          done < <(df -h | grep -vE '^Filesystem|tmpfs|cdrom')
    - name: Schedule disk monitor (every 30 min)
      cron:
        name: "Disk monitor"
        minute: "*/30"
        job: "{{ ops_scripts_dir }}/disk_monitor.sh"
        user: root
    - name: Enable and start monitor service
      systemd:
        name: service-monitor
        state: started
        enabled: yes
        daemon_reload: yes
  handlers:
    - name: reload systemd
      systemd:
        daemon_reload: yes

Case Study: E‑commerce Platform Transformation

Before automation the team performed 15‑20 manual restarts per day, responded to incidents in ~20 minutes, and spent up to 12 hours daily on ops tasks. After a phased rollout (self‑healing, autoscaling, diagnostics, log/backup automation) the daily workload dropped to ~2.5 hours, availability rose to 99.95%, and MTTR fell to 1 minute.

Key Lessons

Implement in stages; avoid a big‑bang change.

Establish robust monitoring before automation.

Thoroughly test in a staging environment.

Retain manual fallback mechanisms.

Document every automated flow and configuration.

Future Trends

AIOps : Machine‑learning‑driven anomaly detection, root‑cause analysis, and predictive maintenance.

GitOps : Treat infrastructure, configuration, and policies as code stored in Git.

Serverless Operations : Leverage FaaS for truly zero‑maintenance services.

Chaos Engineering : Inject failures to validate self‑healing capabilities.

Advice for Ops Engineers

Embrace automation to free time for higher‑value work.

Continuously learn container, Kubernetes, and cloud‑native technologies.

Develop solid scripting skills (Shell, Python).

Adopt a system‑thinking mindset when designing solutions.

Base decisions on data from monitoring and observability platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringautomationOperationsSystemd
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.