How to Patch 500 Servers Overnight with Ansible: A Complete Unattended Automation Guide
Discover a step‑by‑step, fully automated Ansible workflow that patches 500 servers overnight, covering environment setup, inventory design, playbook configuration, rolling updates, health checks, rollback handling, performance tuning, and real‑world case studies for emergency patches and kernel upgrades.
Overview
Quarterly security patching is a major pain point for operations teams. The traditional SSH‑jump‑host + manual script approach caused three sleepless nights for a 500‑machine fleet, with patch‑dependency conflicts and network saturation. To solve this, the team rebuilt the process with Ansible, achieving fully unattended patching from 22:00 to 06:00 across all servers.
Technical Features
Idempotency : Ansible’s yum/apt modules are inherently idempotent, allowing safe retries after interruptions.
Rolling Updates : The serial parameter limits concurrency to avoid overloading services.
Automatic Rollback : Health checks stop further batches on failure, leaving the current state for manual intervention.
Detailed Logging : Every host’s result is logged for audit and troubleshooting.
Applicable Scenarios
Quarterly or monthly bulk security patching within a maintenance window.
Urgent 0‑day vulnerability remediation (e.g., Log4j).
Kernel upgrades that require ordered reboots and service verification.
Environment Requirements
Ansible control node : CentOS 7+ or Ubuntu 18.04+, preferably on a dedicated jump host.
Ansible version : 2.9+ (2.12+ recommended for performance improvements).
Python : 3.6+ installed on target machines.
Target servers : CentOS 7/8, Ubuntu 18/20/22 – mixed environments need separate playbooks.
Network : SSH reachable from the control node; SSH key authentication is advised.
Detailed Steps
1. Preparation
System Check
# Check Ansible version
ansible --version
# Test connectivity to a sample of hosts
ansible -i inventory/prod.ini webservers -m ping --limit 'web-001,web-002,web-003'
# Verify disk space for patch packages
ansible -i inventory/prod.ini all -m shell -a "df -h / | tail -1 | awk '{print $5}'" --limit 'web-001'Install Dependencies
# CentOS/RHEL
sudo yum install -y epel-release
sudo yum install -y ansible
# Ubuntu/Debian
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install -y ansible
# Optional collections for richer modules
ansible-galaxy collection install ansible.posix
ansible-galaxy collection install community.generalConfigure Ansible
# Project directory
mkdir -p ~/ansible-patching/{inventory,group_vars,roles,logs}
cd ~/ansible-patching
# ansible.cfg
cat > ansible.cfg <<'EOF'
[defaults]
inventory = ./inventory/hosts.ini
remote_user = ops
private_key_file = ~/.ssh/ops_key
host_key_checking = False
timeout = 30
forks = 20
log_path = ./logs/ansible.log
callback_whitelist = profile_tasks
[privilege_escalation]
become = True
become_method = sudo
become_user = root
[ssh_connection]
pipelining = True
control_path = /tmp/ansible-%h-%p-%r
EOF2. Core Configuration
Host Inventory
# inventory/hosts.ini
[webservers]
web-[001:100].prod.internal
[appservers]
app-[001:150].prod.internal
[dbservers]
db-[001:050].prod.internal
[cacheservers]
redis-[001:030].prod.internal
memcache-[001:020].prod.internal
[dc1]
web-[001:050].prod.internal
app-[001:075].prod.internal
[dc2]
web-[051:100].prod.internal
app-[076:150].prod.internal
[canary]
web-001.prod.internal
app-001.prod.internal
redis-001.prod.internal
[batch1]
web-[002:020].prod.internal
app-[002:030].prod.internal
[batch2]
web-[021:050].prod.internal
app-[031:075].prod.internal
# ... additional batches as neededNote: The inventory is designed around three dimensions – service type, data‑center, and batch – enabling flexible targeting such as “only update this rack” or “only web tier”.
Variable Configuration
# group_vars/all.yml
---
patching:
allow_reboot: true
reboot_delay: 30
reboot_timeout: 600
health_check_delay: 60
exclude_packages:
- kernel*
- docker*
security_only: true
notification:
webhook_url: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
enabled: true # group_vars/dbservers.yml
---
patching:
allow_reboot: false # databases should not auto‑reboot
exclude_packages:
- kernel*
- mysql*
- mariadb*Parameter explanations : allow_reboot: Whether the host may be rebooted automatically after a patch. reboot_timeout: Maximum time to wait for a reboot to finish; increase for slow hardware. exclude_packages: Packages that must never be upgraded. security_only: Install only security updates, not full upgrades.
Core Playbook
# playbooks/patching.yml
---
- name: System patch update
hosts: "{{ target_hosts | default('all') }}"
gather_facts: yes
serial: "{{ batch_size | default(50) }}"
max_fail_percentage: 10
pre_tasks:
- name: Send start notification
uri:
url: "{{ notification.webhook_url }}"
method: POST
body_format: json
body:
msgtype: "text"
text:
content: "Patch update started: {{ inventory_hostname }} (batch {{ ansible_play_batch }})"
delegate_to: localhost
when: notification.enabled | default(false)
run_once: yes
ignore_errors: yes
tasks:
- name: Check disk space (>1GB)
assert:
that:
- ansible_mounts | selectattr('mount','equalto','/') | map(attribute='size_available') | first | int > 1073741824
fail_msg: "Root partition has less than 1GB free, skipping host"
tags: precheck
- name: Record package list before update (RHEL)
shell: rpm -qa --queryformat '%{NAME}-%{VERSION}-%{RELEASE}
' | sort > /tmp/packages_before_{{ ansible_date_time.date }}.txt
when: ansible_os_family == "RedHat"
tags: precheck
- name: Update YUM cache
yum:
update_cache: yes
when: ansible_os_family == "RedHat"
- name: Install security patches (RHEL)
yum:
name: '*'
state: latest
security: "{{ patching.security_only | default(true) }}"
exclude: "{{ patching.exclude_packages | default([]) }}"
register: yum_result
when: ansible_os_family == "RedHat"
- name: Update APT cache (Debian)
apt:
update_cache: yes
cache_valid_time: 3600
when: ansible_os_family == "Debian"
- name: Install security patches (Debian)
apt:
upgrade: dist
update_cache: yes
register: apt_result
when: ansible_os_family == "Debian"
- name: Detect reboot requirement (Debian)
stat:
path: /var/run/reboot-required
register: reboot_required_file
when: ansible_os_family == "Debian"
- name: Detect reboot requirement (RHEL)
command: needs-restarting -r
register: needs_restarting
failed_when: false
changed_when: false
when: ansible_os_family == "RedHat"
- name: Set reboot flag
set_fact:
needs_reboot: >-
{{ (ansible_os_family == "Debian" and reboot_required_file.stat.exists | default(false)) or
(ansible_os_family == "RedHat" and needs_restarting.rc == 1) }}
- name: Reboot server if needed
reboot:
reboot_timeout: "{{ patching.reboot_timeout | default(600) }}"
pre_reboot_delay: "{{ patching.reboot_delay | default(30) }}"
post_reboot_delay: 30
msg: "Ansible patch reboot"
when:
- needs_reboot
- patching.allow_reboot | default(true)
- name: Record package list after update (RHEL)
shell: |
rpm -qa --queryformat '%{NAME}-%{VERSION}-%{RELEASE}
' | sort > /tmp/packages_after_{{ ansible_date_time.date }}.txt
diff /tmp/packages_before_{{ ansible_date_time.date }}.txt /tmp/packages_after_{{ ansible_date_time.date }}.txt > /tmp/packages_diff_{{ ansible_date_time.date }}.txt || true
when: ansible_os_family == "RedHat"
- name: Health check
uri:
url: "http://localhost:{{ health_check_port | default(8080) }}/health"
status_code: 200
timeout: 30
register: health_check
retries: 3
delay: 10
until: health_check.status == 200
when: health_check_port is defined
ignore_errors: yes
- name: Send completion notification
uri:
url: "{{ notification.webhook_url }}"
method: POST
body_format: json
body:
msgtype: "text"
text:
content: "Patch update completed: {{ inventory_hostname }} - Status: {{ 'SUCCESS' if not (health_check.failed | default(false)) else 'FAILED' }}"
delegate_to: localhost
when: notification.enabled | default(false)
ignore_errors: yes3. Execution and Validation
Canary Run
# Test on canary hosts first
ansible-playbook playbooks/patching.yml -e "target_hosts=canary" -e "batch_size=1" --check
# If check passes, run the canary batch for real
ansible-playbook playbooks/patching.yml -e "target_hosts=canary" -e "batch_size=1"
# Then roll out the remaining batches
ansible-playbook playbooks/patching.yml -e "target_hosts=batch1" -e "batch_size=10"
ansible-playbook playbooks/patching.yml -e "target_hosts=batch2" -e "batch_size=20"
# Or run everything at once; serial controls concurrency
ansible-playbook playbooks/patching.yml -e "target_hosts=all" -e "batch_size=30"Verification
# Follow live logs
tail -f logs/ansible.log
# Summarise results
ansible -i inventory/hosts.ini all -m shell -a "cat /tmp/packages_diff_*.txt 2>/dev/null | head -20" --limit 'web-001'
# Check service status (example nginx)
ansible -i inventory/hosts.ini webservers -m shell -a "systemctl is-active nginx"
# Expected output: each host returns "active"Complete Configuration Example
Unattended Scheduling Script
#!/bin/bash
# /opt/ansible-patching/run_patching.sh – unattended patch scheduler
set -e
WORK_DIR="/opt/ansible-patching"
LOG_DIR="${WORK_DIR}/logs"
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="${LOG_DIR}/patching_${DATE}.log"
cd ${WORK_DIR}
send_notification() {
local message="$1"
local webhook_url="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
curl -s -X POST ${webhook_url} \
-H 'Content-Type: application/json' \
-d "{\"msgtype\": \"text\", \"text\": {\"content\": \"${message}\"}}" > /dev/null 2>&1 || true
}
run_batch() {
local batch_name="$1"
local batch_size="$2"
echo "[$(date)] Starting batch: ${batch_name}" | tee -a ${LOG_FILE}
send_notification "[Patch] Starting batch: ${batch_name}"
if ansible-playbook playbooks/patching.yml -e "target_hosts=${batch_name}" -e "batch_size=${batch_size}" >> ${LOG_FILE} 2>&1; then
echo "[$(date)] Batch ${batch_name} succeeded" | tee -a ${LOG_FILE}
send_notification "[Patch] Batch ${batch_name} succeeded"
return 0
else
echo "[$(date)] Batch ${batch_name} failed" | tee -a ${LOG_FILE}
send_notification "[Patch] Batch ${batch_name} failed, aborting"
return 1
fi
}
main() {
send_notification "[Patch] Starting full run, expected 6‑8h"
if ! run_batch "canary" 1; then
send_notification "[Patch] Canary failed, aborting"
exit 1
fi
echo "[$(date)] Waiting 10 minutes to observe canary..." | tee -a ${LOG_FILE}
sleep 600
local batches=("batch1:10" "batch2:20" "batch3:30" "batch4:50" "batch5:50")
for batch_cfg in "${batches[@]}"; do
batch_name="${batch_cfg%%:*}"
batch_size="${batch_cfg##*:}"
if ! run_batch "${batch_name}" "${batch_size}"; then
send_notification "[Patch] Batch ${batch_name} failed, aborting"
exit 1
fi
echo "[$(date)] Waiting 5 minutes before next batch..." | tee -a ${LOG_FILE}
sleep 300
done
send_notification "[Patch] All batches completed – verify services"
}
main "$@"Cron Scheduling
# /etc/cron.d/ansible-patching
# Run on the first Saturday of each month at 22:00
0 22 1-7 * 6 root /opt/ansible-patching/run_patching.shReal‑World Cases
Emergency Vulnerability Fix
Scenario: A critical glibc 0‑day was disclosed; the security team required the patch to be applied across the entire fleet before the end of the day.
# playbooks/emergency_patch.yml
---
- name: Emergency patch
hosts: "{{ target_hosts | default('all') }}"
gather_facts: yes
serial: 100 # high concurrency for emergencies
tasks:
- name: Install specific packages
yum:
name: "{{ emergency_packages }}"
state: latest
when: ansible_os_family == "RedHat"
- name: Verify installed versions
shell: "rpm -q {{ item }}"
loop: "{{ emergency_packages }}"
register: version_check
when: ansible_os_family == "RedHat"
- name: Show versions
debug:
msg: "{{ version_check.results | map(attribute='stdout') | list }}" # Execute
ansible-playbook playbooks/emergency_patch.yml \
-e "target_hosts=all" \
-e '{"emergency_packages": ["glibc", "glibc-common"]}' \
--forks 100Result: All 500 servers patched in 45 minutes.
Kernel Upgrade with Rolling Reboot
Scenario: A kernel bug required an upgrade without service interruption.
# playbooks/kernel_upgrade.yml
---
- name: Kernel upgrade
hosts: "{{ target_hosts }}"
gather_facts: yes
serial: 1 # one host at a time
tasks:
- name: Drain from load balancer
uri:
url: "http://{{ lb_api }}/api/v1/upstream/{{ inventory_hostname }}/down"
method: POST
delegate_to: localhost
when: lb_api is defined
- name: Wait for connections to drain
wait_for:
timeout: 60
- name: Upgrade kernel
yum:
name: kernel
state: latest
register: kernel_update
- name: Reboot if kernel changed
reboot:
reboot_timeout: 900
msg: "Ansible kernel upgrade reboot"
when: kernel_update.changed
- name: Verify kernel version
shell: uname -r
register: kernel_version
- name: Re‑add to load balancer
uri:
url: "http://{{ lb_api }}/api/v1/upstream/{{ inventory_hostname }}/up"
method: POST
delegate_to: localhost
when: lb_api is defined
- name: Wait for service to recover
wait_for:
timeout: 30Best Practices & Caveats
Performance Optimisation
Enable SSH pipelining to reduce connection overhead:
# ansible.cfg
[ssh_connection]
pipelining = TrueAdjust forks based on control‑node resources and network bandwidth (e.g., 50‑100 for a 4‑core node, lower if patch packages are large).
Use Mitogen for 2‑5× speed gains:
pip install mitogen
# ansible.cfg additions
[defaults]
strategy_plugins = /path/to/mitogen/ansible_mitogen/plugins/strategy
strategy = mitogen_linearSecurity Hardening
Vault encrypted variables for webhook keys and sudo passwords.
# ansible-vault create group_vars/all/vault.yml
vault_webhook_key: "your-secret-key"
vault_ssh_password: "your-password"Restrict sudo to only required commands:
# /etc/sudoers.d/ops
ops ALL=(root) NOPASSWD: /usr/bin/yum, /usr/bin/apt, /usr/sbin/reboot, /usr/bin/systemctlAudit logging – ensure log_path is enabled and retained.
High‑Availability
Deploy multiple Ansible control nodes to avoid a single point of failure.
Consider AWX/Tower for UI, RBAC, and execution history.
Version‑control all playbooks in Git.
Common Pitfalls
SSH timeout : Verify SSH daemon is running and network routes are open.
Sudo password errors : Use NOPASSWD or store credentials in Vault.
Yum lock conflicts : Ensure no other package manager runs; kill stray processes if needed.
Reboot failures : Increase reboot_timeout and check system logs.
Package exclusion mistakes : Review exclude_packages to avoid unintentionally skipping critical updates.
Monitoring
Key metrics to watch during a patch run:
Batch success rate – aim for 100 %, alert if < 95 %.
Per‑host execution time – 3‑10 min typical; > 30 min indicates a problem.
Control‑node CPU – keep below 50 %; alert > 80 %.
Network bandwidth – keep usage under 70 %; alert > 90 %.
# Example Prometheus rule for batch failures
groups:
- name: ansible-patching
rules:
- alert: PatchingBatchFailed
expr: ansible_batch_failed_hosts > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Patch batch failure"
description: "Batch {{ $labels.batch }} has {{ $value }} failed hosts"Backup & Recovery
Before patching, snapshot VMs or take volume snapshots (vCenter, AWS, etc.) so you can roll back if a batch fails.
# Snapshot script (pseudo‑code)
#!/bin/bash
HOSTS_FILE=$1
DATE=$(date +%Y%m%d)
while read host; do
echo "Creating snapshot for ${host}..."
# VMware example
# govc vm.snapshot.create -vm="${host}" "pre-patching-${DATE}"
# AWS example
# aws ec2 create-snapshot --volume-id $(get_volume_id ${host}) --description "pre-patching-${DATE}"
done < ${HOSTS_FILE}Recovery steps:
Stop affected services:
ansible -i inventory/hosts.ini target_host -m service -a "name=nginx state=stopped"Restore the previously taken snapshot via your cloud/virtualisation platform.
Validate system health and data integrity.
Start services again:
ansible -i inventory/hosts.ini target_host -m service -a "name=nginx state=started"Conclusion
The described Ansible‑based workflow turns a multi‑day, error‑prone manual patching process into a reliable, fully unattended nightly job for hundreds of servers. By leveraging idempotent modules, rolling updates, health checks, and detailed logging, teams gain predictability, faster remediation, and clear audit trails while retaining the flexibility to handle emergencies and kernel upgrades.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
