Operations 15 min read

Why Do Your Services Disappear After Reboot? Master systemd Auto‑Start and Chaos Testing

This guide reveals why critical services often fail to start after a server reboot, presents essential systemd unit file parameters, provides ready‑to‑copy configurations for Nginx, Java, and Flask, outlines a four‑step troubleshooting workflow, and introduces a lightweight chaos‑engineering playbook to verify auto‑start resilience.

Xiao Liu Lab
Xiao Liu Lab
Xiao Liu Lab
Why Do Your Services Disappear After Reboot? Master systemd Auto‑Start and Chaos Testing

Problem Overview

At 3 am a server reboot caused alerts: the Web service was unavailable, database connections failed, and scheduled jobs did not run. In a survey of over 100 production servers, 68 % lacked at least one critical service configured for auto‑start, 42 % of operators experienced P1 incidents because of this, and the average MTTR was 27 minutes. The root cause was missing standardized systemd configurations and a lack of validation in a secure environment.

Core systemd Configuration

A typical service unit file consists of three sections: [Unit], [Service], and [Install]. Below are the five key parameters that guarantee reliable auto‑start.

Type : defines the process start mode (e.g., simple or notify). Common error – Java services mistakenly use forking and systemd thinks the start failed.

Restart : sets the restart policy ( on-failure or always). Common error – omitted, so a crashed process never restarts.

After : declares start order (e.g., network.target). Common error – missing, causing services to start before the network is ready.

WantedBy : enables the unit for boot (usually multi-user.target). Common error – omitted, so the unit never starts on boot.

TimeoutStartSec : maximum start‑up time (default 90 s, often increased to 300 s for heavy apps). Common error – default timeout kills long‑running services.

Four‑Step Troubleshooting Guide

Confirm the service is enabled :

systemctl is-enabled nginx.service
# enabled (correct) or disabled (not enabled)
sudo systemctl enable nginx.service

Check status and logs :

systemctl status nginx.service
journalctl -u nginx.service -n 100 -f
journalctl -u nginx.service --since "2025-12-05 02:30:00" --until "2025-12-05 03:00:00"

Manually simulate the start sequence :

sudo -u appuser /opt/app/bin/start.sh
ss -tulnp | grep ':8080'
pstree -p -s $(pgrep -f "myapp")

Validate dependencies :

systemctl list-dependencies nginx.service --all
systemctl status network-online.target

Practical Service Unit Templates

Nginx

[Unit]
Description=The NGINX HTTP and reverse proxy server
After=network.target remote-fs.target nss-lookup.target
Wants=network-online.target

[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t
ExecStart=/usr/sbin/nginx
ExecReload=/usr/sbin/nginx -s reload
ExecStop=/bin/kill -s QUIT $MAINPID
Restart=on-failure
RestartSec=5
TimeoutStartSec=300

[Install]
WantedBy=multi-user.target

Java Application

[Unit]
Description=My Java Application
After=network.target mysql.service redis.service
Requires=mysql.service redis.service

[Service]
Type=simple
User=javaapp
Group=javaapp
WorkingDirectory=/opt/myapp
Environment="JAVA_OPTS=-Xms512m -Xmx2048m -Dspring.profiles.active=prod"
ExecStart=/opt/myapp/bin/start.sh
ExecStop=/opt/myapp/bin/stop.sh
Restart=always
RestartSec=10
TimeoutStartSec=600
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Python Flask

[Unit]
Description=Flask Web Application
After=network.target

[Service]
Type=notify
User=flaskapp
Group=flaskapp
WorkingDirectory=/opt/flaskapp
ExecStart=/opt/flaskapp/venv/bin/gunicorn -w 4 -b 0.0.0.0:5000 app:app
Restart=on-failure
RestartSec=3
TimeoutStartSec=120
Environment="FLASK_ENV=production"
NotifyAccess=all

[Install]
WantedBy=multi-user.target

Chaos Engineering: Verifying Auto‑Start

Running a service in a test environment does not guarantee it will survive an unexpected reboot. A lightweight chaos‑drill validates the whole restart path.

Why Run Chaos Experiments?

"You believe it can auto‑start" ≠ "It actually does auto‑start" – a hard‑earned lesson from an operations director.

Goal: trigger a controlled reboot, then automatically verify that all critical services come up within a defined window.

Drill Plan

T+0 : Notify stakeholders.

T+5 min : Execute systemctl reboot.

T+10 min : Verify service status, API response, DB connectivity.

T+15 min : Generate a report with MTTR and any failures.

Sample Scripts

safe_reboot.sh – prepares the environment, records pre‑check status, and reboots the host.

#!/bin/bash
CRITICAL_SERVICES=("nginx" "mysql" "myapp" "redis")
LOG_FILE="/var/log/reboot_drill_$(date +%Y%m%d_%H%M%S).log"

echo "[$(date)] Starting chaos drill: safe reboot" | tee -a $LOG_FILE
for service in "${CRITICAL_SERVICES[@]}"; do
  echo "[$(date)] Pre‑check $service: $(systemctl is-active $service)" | tee -a $LOG_FILE
done

echo "[$(date)] Executing reboot" | tee -a $LOG_FILE
sudo systemctl reboot

post_reboot_check.sh – runs after the system comes back, waits for network, then checks each service with a timeout.

#!/bin/bash
LOG_FILE="/var/log/reboot_drill_$(date +%Y%m%d_%H%M%S).log"
CRITICAL_SERVICES=("nginx" "mysql" "myapp" "redis")
MAX_WAIT_TIME=300
CHECK_INTERVAL=10

echo "[$(date)] Post‑reboot verification start" | tee -a $LOG_FILE
while ! ping -c 1 baidu.com &>/dev/null; do sleep 5; done

echo "[$(date)] Network ready" | tee -a $LOG_FILE
for service in "${CRITICAL_SERVICES[@]}"; do
  echo "[$(date)] Checking $service" | tee -a $LOG_FILE
  elapsed=0
  while [ $elapsed -lt $MAX_WAIT_TIME ]; do
    if systemctl is-active --quiet $service; then
      echo "[$(date)] $service started successfully" | tee -a $LOG_FILE
      break
    fi
    sleep $CHECK_INTERVAL
    elapsed=$((elapsed+CHECK_INTERVAL))
    echo "[$(date)] $service still starting... ($elapsed/$MAX_WAIT_TIME)" | tee -a $LOG_FILE
done
  if ! systemctl is-active --quiet $service; then
    echo "[$(date)] $service failed to start" | tee -a $LOG_FILE
    ALL_SERVICES_UP=false
  fi
done

if [ "$ALL_SERVICES_UP" = "true" ]; then
  echo "[$(date)] [SUCCESS] All services auto‑started in $elapsed seconds" | tee -a $LOG_FILE
else
  echo "[$(date)] [FAILED] Some services did not auto‑start" | tee -a $LOG_FILE
fi

Best Practices for Service Auto‑Start

Standardization : create reusable .service templates per application type and store them in configuration management.

Version Control : keep unit files in a Git repository so every change is tracked.

Checklist (run before release):

WantedBy=multi-user.target set?

All dependencies declared in After/Requires?

TimeoutStartSec reasonable for the workload?

Restart policy configured?

Correct user/group permissions?

Automation : schedule a daily script that verifies systemctl is-enabled for all critical services and sends alerts if any are disabled.

Chaos Regularization : run the reboot drill monthly for core services and quarterly for less critical ones; track success rate and MTTR.

Conclusion

By applying a disciplined systemd configuration, following a systematic troubleshooting workflow, and regularly exercising a lightweight chaos‑engineering drill, teams can turn the “service disappears after reboot” nightmare into a predictable, quickly recoverable event, moving from mere uptime promises to true self‑healing resilience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Chaos EngineeringLinux operationssystemdservice auto-start
Xiao Liu Lab
Written by

Xiao Liu Lab

An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.