Operations 29 min read

How to Patch 500 Servers Overnight with Ansible: A Complete Unattended Automation Guide

Discover a step‑by‑step, fully automated Ansible workflow that patches 500 servers overnight, covering environment setup, inventory design, playbook configuration, rolling updates, health checks, rollback handling, performance tuning, and real‑world case studies for emergency patches and kernel upgrades.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Patch 500 Servers Overnight with Ansible: A Complete Unattended Automation Guide

Overview

Quarterly security patching is a major pain point for operations teams. The traditional SSH‑jump‑host + manual script approach caused three sleepless nights for a 500‑machine fleet, with patch‑dependency conflicts and network saturation. To solve this, the team rebuilt the process with Ansible, achieving fully unattended patching from 22:00 to 06:00 across all servers.

Technical Features

Idempotency : Ansible’s yum/apt modules are inherently idempotent, allowing safe retries after interruptions.

Rolling Updates : The serial parameter limits concurrency to avoid overloading services.

Automatic Rollback : Health checks stop further batches on failure, leaving the current state for manual intervention.

Detailed Logging : Every host’s result is logged for audit and troubleshooting.

Applicable Scenarios

Quarterly or monthly bulk security patching within a maintenance window.

Urgent 0‑day vulnerability remediation (e.g., Log4j).

Kernel upgrades that require ordered reboots and service verification.

Environment Requirements

Ansible control node : CentOS 7+ or Ubuntu 18.04+, preferably on a dedicated jump host.

Ansible version : 2.9+ (2.12+ recommended for performance improvements).

Python : 3.6+ installed on target machines.

Target servers : CentOS 7/8, Ubuntu 18/20/22 – mixed environments need separate playbooks.

Network : SSH reachable from the control node; SSH key authentication is advised.

Detailed Steps

1. Preparation

System Check

# Check Ansible version
ansible --version

# Test connectivity to a sample of hosts
ansible -i inventory/prod.ini webservers -m ping --limit 'web-001,web-002,web-003'

# Verify disk space for patch packages
ansible -i inventory/prod.ini all -m shell -a "df -h / | tail -1 | awk '{print $5}'" --limit 'web-001'

Install Dependencies

# CentOS/RHEL
sudo yum install -y epel-release
sudo yum install -y ansible

# Ubuntu/Debian
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install -y ansible

# Optional collections for richer modules
ansible-galaxy collection install ansible.posix
ansible-galaxy collection install community.general

Configure Ansible

# Project directory
mkdir -p ~/ansible-patching/{inventory,group_vars,roles,logs}
cd ~/ansible-patching

# ansible.cfg
cat > ansible.cfg <<'EOF'
[defaults]
inventory = ./inventory/hosts.ini
remote_user = ops
private_key_file = ~/.ssh/ops_key
host_key_checking = False
timeout = 30
forks = 20
log_path = ./logs/ansible.log
callback_whitelist = profile_tasks

[privilege_escalation]
become = True
become_method = sudo
become_user = root

[ssh_connection]
pipelining = True
control_path = /tmp/ansible-%h-%p-%r
EOF

2. Core Configuration

Host Inventory

# inventory/hosts.ini
[webservers]
web-[001:100].prod.internal

[appservers]
app-[001:150].prod.internal

[dbservers]
db-[001:050].prod.internal

[cacheservers]
redis-[001:030].prod.internal
memcache-[001:020].prod.internal

[dc1]
web-[001:050].prod.internal
app-[001:075].prod.internal

[dc2]
web-[051:100].prod.internal
app-[076:150].prod.internal

[canary]
web-001.prod.internal
app-001.prod.internal
redis-001.prod.internal

[batch1]
web-[002:020].prod.internal
app-[002:030].prod.internal

[batch2]
web-[021:050].prod.internal
app-[031:075].prod.internal
# ... additional batches as needed

Note: The inventory is designed around three dimensions – service type, data‑center, and batch – enabling flexible targeting such as “only update this rack” or “only web tier”.

Variable Configuration

# group_vars/all.yml
---
patching:
  allow_reboot: true
  reboot_delay: 30
  reboot_timeout: 600
  health_check_delay: 60
  exclude_packages:
    - kernel*
    - docker*
  security_only: true

notification:
  webhook_url: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
  enabled: true
# group_vars/dbservers.yml
---
patching:
  allow_reboot: false  # databases should not auto‑reboot
  exclude_packages:
    - kernel*
    - mysql*
    - mariadb*

Parameter explanations : allow_reboot: Whether the host may be rebooted automatically after a patch. reboot_timeout: Maximum time to wait for a reboot to finish; increase for slow hardware. exclude_packages: Packages that must never be upgraded. security_only: Install only security updates, not full upgrades.

Core Playbook

# playbooks/patching.yml
---
- name: System patch update
  hosts: "{{ target_hosts | default('all') }}"
  gather_facts: yes
  serial: "{{ batch_size | default(50) }}"
  max_fail_percentage: 10

  pre_tasks:
    - name: Send start notification
      uri:
        url: "{{ notification.webhook_url }}"
        method: POST
        body_format: json
        body:
          msgtype: "text"
          text:
            content: "Patch update started: {{ inventory_hostname }} (batch {{ ansible_play_batch }})"
      delegate_to: localhost
      when: notification.enabled | default(false)
      run_once: yes
      ignore_errors: yes

  tasks:
    - name: Check disk space (>1GB)
      assert:
        that:
          - ansible_mounts | selectattr('mount','equalto','/') | map(attribute='size_available') | first | int > 1073741824
        fail_msg: "Root partition has less than 1GB free, skipping host"
      tags: precheck

    - name: Record package list before update (RHEL)
      shell: rpm -qa --queryformat '%{NAME}-%{VERSION}-%{RELEASE}
' | sort > /tmp/packages_before_{{ ansible_date_time.date }}.txt
      when: ansible_os_family == "RedHat"
      tags: precheck

    - name: Update YUM cache
      yum:
        update_cache: yes
      when: ansible_os_family == "RedHat"

    - name: Install security patches (RHEL)
      yum:
        name: '*'
        state: latest
        security: "{{ patching.security_only | default(true) }}"
        exclude: "{{ patching.exclude_packages | default([]) }}"
      register: yum_result
      when: ansible_os_family == "RedHat"

    - name: Update APT cache (Debian)
      apt:
        update_cache: yes
        cache_valid_time: 3600
      when: ansible_os_family == "Debian"

    - name: Install security patches (Debian)
      apt:
        upgrade: dist
        update_cache: yes
      register: apt_result
      when: ansible_os_family == "Debian"

    - name: Detect reboot requirement (Debian)
      stat:
        path: /var/run/reboot-required
      register: reboot_required_file
      when: ansible_os_family == "Debian"

    - name: Detect reboot requirement (RHEL)
      command: needs-restarting -r
      register: needs_restarting
      failed_when: false
      changed_when: false
      when: ansible_os_family == "RedHat"

    - name: Set reboot flag
      set_fact:
        needs_reboot: >-
          {{ (ansible_os_family == "Debian" and reboot_required_file.stat.exists | default(false)) or
             (ansible_os_family == "RedHat" and needs_restarting.rc == 1) }}

    - name: Reboot server if needed
      reboot:
        reboot_timeout: "{{ patching.reboot_timeout | default(600) }}"
        pre_reboot_delay: "{{ patching.reboot_delay | default(30) }}"
        post_reboot_delay: 30
        msg: "Ansible patch reboot"
      when:
        - needs_reboot
        - patching.allow_reboot | default(true)

    - name: Record package list after update (RHEL)
      shell: |
        rpm -qa --queryformat '%{NAME}-%{VERSION}-%{RELEASE}
' | sort > /tmp/packages_after_{{ ansible_date_time.date }}.txt
        diff /tmp/packages_before_{{ ansible_date_time.date }}.txt /tmp/packages_after_{{ ansible_date_time.date }}.txt > /tmp/packages_diff_{{ ansible_date_time.date }}.txt || true
      when: ansible_os_family == "RedHat"

    - name: Health check
      uri:
        url: "http://localhost:{{ health_check_port | default(8080) }}/health"
        status_code: 200
        timeout: 30
      register: health_check
      retries: 3
      delay: 10
      until: health_check.status == 200
      when: health_check_port is defined
      ignore_errors: yes

    - name: Send completion notification
      uri:
        url: "{{ notification.webhook_url }}"
        method: POST
        body_format: json
        body:
          msgtype: "text"
          text:
            content: "Patch update completed: {{ inventory_hostname }} - Status: {{ 'SUCCESS' if not (health_check.failed | default(false)) else 'FAILED' }}"
      delegate_to: localhost
      when: notification.enabled | default(false)
      ignore_errors: yes

3. Execution and Validation

Canary Run

# Test on canary hosts first
ansible-playbook playbooks/patching.yml -e "target_hosts=canary" -e "batch_size=1" --check

# If check passes, run the canary batch for real
ansible-playbook playbooks/patching.yml -e "target_hosts=canary" -e "batch_size=1"

# Then roll out the remaining batches
ansible-playbook playbooks/patching.yml -e "target_hosts=batch1" -e "batch_size=10"
ansible-playbook playbooks/patching.yml -e "target_hosts=batch2" -e "batch_size=20"
# Or run everything at once; serial controls concurrency
ansible-playbook playbooks/patching.yml -e "target_hosts=all" -e "batch_size=30"

Verification

# Follow live logs
tail -f logs/ansible.log

# Summarise results
ansible -i inventory/hosts.ini all -m shell -a "cat /tmp/packages_diff_*.txt 2>/dev/null | head -20" --limit 'web-001'

# Check service status (example nginx)
ansible -i inventory/hosts.ini webservers -m shell -a "systemctl is-active nginx"
# Expected output: each host returns "active"

Complete Configuration Example

Unattended Scheduling Script

#!/bin/bash
# /opt/ansible-patching/run_patching.sh – unattended patch scheduler
set -e
WORK_DIR="/opt/ansible-patching"
LOG_DIR="${WORK_DIR}/logs"
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="${LOG_DIR}/patching_${DATE}.log"
cd ${WORK_DIR}

send_notification() {
  local message="$1"
  local webhook_url="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
  curl -s -X POST ${webhook_url} \
    -H 'Content-Type: application/json' \
    -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"${message}\"}}" > /dev/null 2>&1 || true
}

run_batch() {
  local batch_name="$1"
  local batch_size="$2"
  echo "[$(date)] Starting batch: ${batch_name}" | tee -a ${LOG_FILE}
  send_notification "[Patch] Starting batch: ${batch_name}"
  if ansible-playbook playbooks/patching.yml -e "target_hosts=${batch_name}" -e "batch_size=${batch_size}" >> ${LOG_FILE} 2>&1; then
    echo "[$(date)] Batch ${batch_name} succeeded" | tee -a ${LOG_FILE}
    send_notification "[Patch] Batch ${batch_name} succeeded"
    return 0
  else
    echo "[$(date)] Batch ${batch_name} failed" | tee -a ${LOG_FILE}
    send_notification "[Patch] Batch ${batch_name} failed, aborting"
    return 1
  fi
}

main() {
  send_notification "[Patch] Starting full run, expected 6‑8h"
  if ! run_batch "canary" 1; then
    send_notification "[Patch] Canary failed, aborting"
    exit 1
  fi
  echo "[$(date)] Waiting 10 minutes to observe canary..." | tee -a ${LOG_FILE}
  sleep 600
  local batches=("batch1:10" "batch2:20" "batch3:30" "batch4:50" "batch5:50")
  for batch_cfg in "${batches[@]}"; do
    batch_name="${batch_cfg%%:*}"
    batch_size="${batch_cfg##*:}"
    if ! run_batch "${batch_name}" "${batch_size}"; then
      send_notification "[Patch] Batch ${batch_name} failed, aborting"
      exit 1
    fi
    echo "[$(date)] Waiting 5 minutes before next batch..." | tee -a ${LOG_FILE}
    sleep 300
  done
  send_notification "[Patch] All batches completed – verify services"
}

main "$@"

Cron Scheduling

# /etc/cron.d/ansible-patching
# Run on the first Saturday of each month at 22:00
0 22 1-7 * 6 root /opt/ansible-patching/run_patching.sh

Real‑World Cases

Emergency Vulnerability Fix

Scenario: A critical glibc 0‑day was disclosed; the security team required the patch to be applied across the entire fleet before the end of the day.

# playbooks/emergency_patch.yml
---
- name: Emergency patch
  hosts: "{{ target_hosts | default('all') }}"
  gather_facts: yes
  serial: 100  # high concurrency for emergencies
  tasks:
    - name: Install specific packages
      yum:
        name: "{{ emergency_packages }}"
        state: latest
      when: ansible_os_family == "RedHat"

    - name: Verify installed versions
      shell: "rpm -q {{ item }}"
      loop: "{{ emergency_packages }}"
      register: version_check
      when: ansible_os_family == "RedHat"

    - name: Show versions
      debug:
        msg: "{{ version_check.results | map(attribute='stdout') | list }}"
# Execute
ansible-playbook playbooks/emergency_patch.yml \
  -e "target_hosts=all" \
  -e '{"emergency_packages": ["glibc", "glibc-common"]}' \
  --forks 100

Result: All 500 servers patched in 45 minutes.

Kernel Upgrade with Rolling Reboot

Scenario: A kernel bug required an upgrade without service interruption.

# playbooks/kernel_upgrade.yml
---
- name: Kernel upgrade
  hosts: "{{ target_hosts }}"
  gather_facts: yes
  serial: 1  # one host at a time
  tasks:
    - name: Drain from load balancer
      uri:
        url: "http://{{ lb_api }}/api/v1/upstream/{{ inventory_hostname }}/down"
        method: POST
      delegate_to: localhost
      when: lb_api is defined

    - name: Wait for connections to drain
      wait_for:
        timeout: 60

    - name: Upgrade kernel
      yum:
        name: kernel
        state: latest
      register: kernel_update

    - name: Reboot if kernel changed
      reboot:
        reboot_timeout: 900
        msg: "Ansible kernel upgrade reboot"
      when: kernel_update.changed

    - name: Verify kernel version
      shell: uname -r
      register: kernel_version

    - name: Re‑add to load balancer
      uri:
        url: "http://{{ lb_api }}/api/v1/upstream/{{ inventory_hostname }}/up"
        method: POST
      delegate_to: localhost
      when: lb_api is defined

    - name: Wait for service to recover
      wait_for:
        timeout: 30

Best Practices & Caveats

Performance Optimisation

Enable SSH pipelining to reduce connection overhead:

# ansible.cfg
[ssh_connection]
pipelining = True

Adjust forks based on control‑node resources and network bandwidth (e.g., 50‑100 for a 4‑core node, lower if patch packages are large).

Use Mitogen for 2‑5× speed gains:

pip install mitogen
# ansible.cfg additions
[defaults]
strategy_plugins = /path/to/mitogen/ansible_mitogen/plugins/strategy
strategy = mitogen_linear

Security Hardening

Vault encrypted variables for webhook keys and sudo passwords.

# ansible-vault create group_vars/all/vault.yml
vault_webhook_key: "your-secret-key"
vault_ssh_password: "your-password"

Restrict sudo to only required commands:

# /etc/sudoers.d/ops
ops ALL=(root) NOPASSWD: /usr/bin/yum, /usr/bin/apt, /usr/sbin/reboot, /usr/bin/systemctl

Audit logging – ensure log_path is enabled and retained.

High‑Availability

Deploy multiple Ansible control nodes to avoid a single point of failure.

Consider AWX/Tower for UI, RBAC, and execution history.

Version‑control all playbooks in Git.

Common Pitfalls

SSH timeout : Verify SSH daemon is running and network routes are open.

Sudo password errors : Use NOPASSWD or store credentials in Vault.

Yum lock conflicts : Ensure no other package manager runs; kill stray processes if needed.

Reboot failures : Increase reboot_timeout and check system logs.

Package exclusion mistakes : Review exclude_packages to avoid unintentionally skipping critical updates.

Monitoring

Key metrics to watch during a patch run:

Batch success rate – aim for 100 %, alert if < 95 %.

Per‑host execution time – 3‑10 min typical; > 30 min indicates a problem.

Control‑node CPU – keep below 50 %; alert > 80 %.

Network bandwidth – keep usage under 70 %; alert > 90 %.

# Example Prometheus rule for batch failures
groups:
- name: ansible-patching
  rules:
  - alert: PatchingBatchFailed
    expr: ansible_batch_failed_hosts > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Patch batch failure"
      description: "Batch {{ $labels.batch }} has {{ $value }} failed hosts"

Backup & Recovery

Before patching, snapshot VMs or take volume snapshots (vCenter, AWS, etc.) so you can roll back if a batch fails.

# Snapshot script (pseudo‑code)
#!/bin/bash
HOSTS_FILE=$1
DATE=$(date +%Y%m%d)
while read host; do
  echo "Creating snapshot for ${host}..."
  # VMware example
  # govc vm.snapshot.create -vm="${host}" "pre-patching-${DATE}"
  # AWS example
  # aws ec2 create-snapshot --volume-id $(get_volume_id ${host}) --description "pre-patching-${DATE}"
done < ${HOSTS_FILE}

Recovery steps:

Stop affected services:

ansible -i inventory/hosts.ini target_host -m service -a "name=nginx state=stopped"

Restore the previously taken snapshot via your cloud/virtualisation platform.

Validate system health and data integrity.

Start services again:

ansible -i inventory/hosts.ini target_host -m service -a "name=nginx state=started"

Conclusion

The described Ansible‑based workflow turns a multi‑day, error‑prone manual patching process into a reliable, fully unattended nightly job for hundreds of servers. By leveraging idempotent modules, rolling updates, health checks, and detailed logging, teams gain predictability, faster remediation, and clear audit trails while retaining the flexibility to handle emergencies and kernel upgrades.

AutomationDevOpsPatch managementAnsibleRolling Updateserver maintenance
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.