Mastering Enterprise CI/CD with Ansible: A Complete Hands‑On Guide
This comprehensive guide explains how to build an enterprise‑grade CI/CD automation platform with Ansible, covering its evolution, core principles, environment setup, dynamic inventory, modular playbooks, GitLab integration, blue‑green deployments, Vault security, custom module development, real‑world case studies, performance tuning, error handling, monitoring, and testing with Molecule.
Overview
Ansible provides an agentless, idempotent, and declarative automation framework that can be used to build enterprise‑grade CI/CD pipelines and infrastructure management solutions.
Core Principles
Agentless Architecture
# Ansible connects via SSH
ansible all -m ping -i inventory.ini
# No additional software required on target hostsIdempotency
# Example: ensure Nginx is installed and started
- name: Ensure nginx is installed and started
systemd:
name: nginx
state: started
enabled: yes
# Re‑running the playbook yields the same stateDeclarative YAML Syntax
# Simple playbook fragment
- hosts: webservers
tasks:
- name: Install nginx
package:
name: nginx
state: presentInfrastructure Setup
Control‑Node Installation
# CentOS/RHEL
sudo yum install epel-release
sudo yum install ansible
# Ubuntu/Debian
sudo apt update
sudo apt install ansible
# Latest version via pip
pip3 install ansible ansible-core
# Verify
ansible --versionansible.cfg Performance Tuning
[defaults]
forks = 50
host_key_checking = False
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null
pipelining = True
gathering = smart
fact_caching = memory
fact_caching_timeout = 86400
log_path = /var/log/ansible.log
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
control_path = /tmp/ansible-ssh-%%h-%%p-%%rDynamic Inventory (Python Example)
# inventory/dynamic_inventory.py
#!/usr/bin/env python3
import json, requests
from argparse import ArgumentParser
class DynamicInventory:
def __init__(self):
self.inventory = {}
self.read_cli_args()
if self.args.list:
self.inventory = self.get_inventory()
elif self.args.host:
self.inventory = self.get_host_info(self.args.host)
print(json.dumps(self.inventory))
def get_inventory(self):
try:
response = requests.get('http://cmdb.company.com/api/hosts')
hosts_data = response.json()
inventory = {'_meta': {'hostvars': {}}, 'webservers': {'hosts': []}, 'databases': {'hosts': []}, 'loadbalancers': {'hosts': []}}
for host in hosts_data:
group = host['role']
if group in inventory:
inventory[group]['hosts'].append(host['hostname'])
inventory['_meta']['hostvars'][host['hostname']] = {
'ansible_host': host['ip_address'],
'environment': host['environment'],
'datacenter': host['datacenter']
}
return inventory
except Exception:
return {'_meta': {'hostvars': {}}}
def get_host_info(self, hostname):
return {}
def read_cli_args(self):
parser = ArgumentParser()
parser.add_argument('--list', action='store_true')
parser.add_argument('--host', action='store')
self.args = parser.parse_args()
if __name__ == '__main__':
DynamicInventory()Playbook Architecture
Directory Layout
ansible-infrastructure/
├── inventories/
│ ├── production/
│ │ ├── hosts.yml
│ │ └── group_vars/
│ ├── staging/
│ └── development/
├── roles/
│ ├── common/
│ ├── nginx/
│ ├── mysql/
│ └── monitoring/
├── playbooks/
│ ├── site.yml
│ ├── webservers.yml
│ └── databases.yml
├── group_vars/
├── host_vars/
└── ansible.cfgSite Playbook (Orchestrates All Roles)
# playbooks/site.yml
---
- name: Common system configuration
hosts: all
become: yes
roles:
- common
- security
- monitoring-agent
- name: Web server configuration
hosts: webservers
become: yes
roles:
- nginx
- php-fpm
- ssl-certificates
- name: Database server configuration
hosts: databases
become: yes
roles:
- mysql
- backup
- performance-tuning
- name: Load balancer configuration
hosts: loadbalancers
become: yes
roles:
- haproxy
- keepalivedNGINX Role – Tasks
# roles/nginx/tasks/main.yml
---
- name: Install nginx
package:
name: nginx
state: present
notify: restart nginx
- name: Create configuration directories
file:
path: "{{ item }}"
state: directory
owner: root
group: root
mode: '0755'
loop:
- /etc/nginx/sites-available
- /etc/nginx/sites-enabled
- /var/log/nginx
- name: Deploy main nginx.conf
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
backup: yes
notify: reload nginx
tags: config
- name: Deploy virtual hosts
template:
src: vhost.conf.j2
dest: "/etc/nginx/sites-available/{{ item.name }}"
loop: "{{ nginx_vhosts }}"
notify: reload nginx
tags: vhosts
- name: Enable virtual hosts
file:
src: "/etc/nginx/sites-available/{{ item.name }}"
dest: "/etc/nginx/sites-enabled/{{ item.name }}"
state: link
loop: "{{ nginx_vhosts }}"
when: item.enabled | default(true)
notify: reload nginx
- name: Ensure nginx service is running
systemd:
name: nginx
state: started
enabled: yesNGINX Role – Default Variables
# roles/nginx/defaults/main.yml
---
nginx_user: www-data
nginx_worker_processes: auto
nginx_worker_connections: 1024
nginx_keepalive_timeout: 65
nginx_client_max_body_size: 64m
nginx_vhosts:
- name: default
listen: 80
server_name: _
root: /var/www/html
index: "index.html index.htm"
enabled: true
nginx_performance:
sendfile: "on"
tcp_nopush: "on"
tcp_nodelay: "on"
gzip: "on"
gzip_vary: "on"
gzip_comp_level: 6CI/CD Integration
GitLab CI Pipeline
# .gitlab-ci.yml
stages:
- validate
- test
- deploy-staging
- deploy-production
variables:
ANSIBLE_HOST_KEY_CHECKING: "False"
ANSIBLE_FORCE_COLOR: "True"
validate-playbooks:
stage: validate
image: ansible/ansible-runner:latest
script:
- ansible-playbook --syntax-check playbooks/site.yml
- ansible-lint playbooks/site.yml
only:
- merge_requests
- master
test-roles:
stage: test
image: ansible/ansible-runner:latest
script:
- molecule test
only:
- merge_requests
deploy-staging:
stage: deploy-staging
image: ansible/ansible-runner:latest
script:
- ansible-playbook -i inventories/staging playbooks/site.yml --check --diff
- ansible-playbook -i inventories/staging playbooks/site.yml
only:
- master
deploy-production:
stage: deploy-production
image: ansible/ansible-runner:latest
script:
- ansible-playbook -i inventories/production playbooks/site.yml --check --diff
- ansible-playbook -i inventories/production playbooks/site.yml
when: manual
only:
- masterBlue‑Green Deployment Playbook
# playbooks/blue-green-deploy.yml
---
- name: Blue‑Green deployment
hosts: webservers
serial: "{{ batch_size | default(1) }}"
vars:
current_color: "{{ ansible_local.deployment.color | default('blue') }}"
new_color: "{{ 'green' if current_color == 'blue' else 'blue' }}"
tasks:
- name: Determine deployment path
set_fact:
deploy_path: "/opt/app/{{ new_color }}"
- name: Create new version directory
file:
path: "{{ deploy_path }}"
state: directory
- name: Deploy new package
unarchive:
src: "{{ app_package_url }}"
dest: "{{ deploy_path }}"
remote_src: yes
- name: Render configuration
template:
src: app.conf.j2
dest: "{{ deploy_path }}/config/app.conf"
- name: Health check new version
uri:
url: "http://{{ ansible_host }}:{{ app_port }}/health"
method: GET
timeout: 30
register: health_check
retries: 5
delay: 10
- name: Update load‑balancer upstream
template:
src: nginx-upstream.j2
dest: /etc/nginx/conf.d/upstream.conf
delegate_to: "{{ groups['loadbalancers'] }}"
notify: reload nginx
- name: Record deployment state
copy:
content: |
[deployment]
color={{ new_color }}
version={{ app_version }}
timestamp={{ ansible_date_time.epoch }}
dest: /etc/ansible/facts.d/deployment.factAdvanced Features
Vault for Sensitive Data
# Create an encrypted vault file
ansible-vault create group_vars/production/vault.yml
# Edit the vault file
ansible-vault edit group_vars/production/vault.yml
# Encrypt an existing file
ansible-vault encrypt inventories/production/secrets.yml
# Use vault variables in a playbook
ansible-playbook -i inventories/production playbooks/site.yml --ask-vault-passTypical decrypted content (for illustration):
# group_vars/production/vault.yml (after decryption)
vault_mysql_root_password: "SuperSecretPassword123!"
vault_api_key: "sk-1234567890abcdef"
vault_ssl_private_key: |
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQC7...
-----END PRIVATE KEY-----Custom Ansible Module (Service Health Check)
# library/service_check.py
#!/usr/bin/python3
from ansible.module_utils.basic import AnsibleModule
import requests, time
def check_service_health(url, timeout=30, retries=3):
"""Check service health with retries"""
for attempt in range(retries):
try:
response = requests.get(url, timeout=timeout)
if response.status_code == 200:
return True, f"Service is healthy (status: {response.status_code})"
except requests.exceptions.RequestException as e:
if attempt == retries - 1:
return False, f"Service check failed: {e}"
time.sleep(5)
return False, "Service health check failed after all retries"
def main():
module = AnsibleModule(
argument_spec=dict(
url=dict(type='str', required=True),
timeout=dict(type='int', default=30),
retries=dict(type='int', default=3),
expected_status=dict(type='int', default=200)
),
supports_check_mode=True
)
url = module.params['url']
timeout = module.params['timeout']
retries = module.params['retries']
if module.check_mode:
module.exit_json(changed=False, msg="Check mode – would check service health")
is_healthy, message = check_service_health(url, timeout, retries)
if is_healthy:
module.exit_json(changed=False, msg=message, status="healthy")
else:
module.fail_json(msg=message, status="unhealthy")
if __name__ == '__main__':
main()Case Studies
Large‑Scale Internet Company
Background: 3,000+ servers across web, database, cache, and messaging layers required unified automation.
Solution Architecture:
Layered environment definition (production, staging, development) with region and security level metadata.
# environments configuration
environments:
- name: production
regions: [us-west-1, us-east-1, eu-west-1]
security_level: high
- name: staging
regions: [us-west-1]
security_level: medium
- name: development
regions: [us-west-1]
security_level: lowService discovery via a custom Consul inventory plugin.
# plugins/inventory/consul_inventory.py
import consul, json
class ConsulInventory:
def __init__(self):
self.consul = consul.Consul()
self.inventory = {'_meta': {'hostvars': {}}}
def get_inventory(self):
services = self.consul.catalog.services()[1]
for service_name in services:
nodes = self.consul.catalog.service(service_name)[1]
if service_name not in self.inventory:
self.inventory[service_name] = {'hosts': []}
for node in nodes:
hostname = node['Node']
self.inventory[service_name]['hosts'].append(hostname)
self.inventory['_meta']['hostvars'][hostname] = {
'ansible_host': node['Address'],
'service_port': node['ServicePort'],
'datacenter': node['Datacenter']
}
return self.inventoryRolling‑update microservice deployment with pre‑ and post‑tasks for load‑balancer registration.
# playbooks/microservice-deploy.yml
---
- name: Microservice deployment
hosts: "{{ service_name }}"
serial: "{{ rolling_update_batch_size | default('25%') }}"
max_fail_percentage: 10
pre_tasks:
- name: Remove node from LB
uri:
url: "http://{{ lb_host }}/api/v1/upstream/{{ service_name }}/remove"
method: POST
body_format: json
body:
server: "{{ ansible_host }}:{{ service_port }}"
delegate_to: localhost
tasks:
- name: Stop old service
systemd:
name: "{{ service_name }}"
state: stopped
- name: Backup current version
archive:
path: "/opt/{{ service_name }}"
dest: "/backup/{{ service_name }}-{{ ansible_date_time.epoch }}.tar.gz"
- name: Deploy new version
unarchive:
src: "{{ artifact_url }}"
dest: "/opt/{{ service_name }}"
remote_src: yes
owner: "{{ service_user }}"
group: "{{ service_group }}"
- name: Update configuration
template:
src: "{{ service_name }}.conf.j2"
dest: "/opt/{{ service_name }}/config/app.conf"
notify: restart {{ service_name }}
- name: Start service
systemd:
name: "{{ service_name }}"
state: started
enabled: yes
- name: Health check
uri:
url: "http://{{ ansible_host }}:{{ service_port }}/health"
register: health_result
retries: 10
delay: 30
until: health_result.status == 200
post_tasks:
- name: Re‑add node to LB
uri:
url: "http://{{ lb_host }}/api/v1/upstream/{{ service_name }}/add"
method: POST
body_format: json
body:
server: "{{ ansible_host }}:{{ service_port }}"
delegate_to: localhostResults:
Deployment time reduced from 2 hours to 15 minutes.
Success rate increased from 85 % to 99.5 %.
Operational labor cost cut by 60 %.
System availability rose to 99.99 %.
Financial Industry Compliance Automation
Background: A bank needed to meet PCI‑DSS, SOX, and related standards through automated checks and remediation.
Solution:
Security baseline enforcement (SSH hardening, firewall rules, disabling unnecessary services).
# roles/security-compliance/tasks/main.yml
---
- name: Enforce SSH configuration
lineinfile:
path: /etc/ssh/sshd_config
regexp: "{{ item.regexp }}"
line: "{{ item.line }}"
state: present
loop:
- { regexp: '^Protocol', line: 'Protocol 2' }
- { regexp: '^PermitRootLogin', line: 'PermitRootLogin no' }
- { regexp: '^PasswordAuthentication', line: 'PasswordAuthentication no' }
- { regexp: '^ClientAliveInterval', line: 'ClientAliveInterval 300' }
notify: restart sshd
- name: Configure firewall services
firewalld:
service: "{{ item }}"
permanent: yes
state: enabled
immediate: yes
loop:
- ssh
- https
- name: Disable unnecessary services
systemd:
name: "{{ item }}"
state: stopped
enabled: no
loop:
- telnet
- rsh
- rlogin
ignore_errors: yesCompliance report generation (system facts, password policy, user accounts) rendered to HTML.
# playbooks/compliance-report.yml
---
- name: Generate compliance report
hosts: all
gather_facts: yes
tasks:
- name: Collect system information
setup:
gather_subset:
- hardware
- network
- services
- name: Check password policy
shell: |
grep -E '^PASS_MAX_DAYS|^PASS_MIN_DAYS|^PASS_WARN_AGE' /etc/login.defs
register: password_policy
- name: List regular user accounts
shell: |
awk -F: '($3 >= 1000) {print $1}' /etc/passwd
register: user_accounts
- name: Render compliance HTML report
template:
src: compliance-report.j2
dest: "/tmp/compliance-report-{{ ansible_hostname }}.html"
delegate_to: localhostResults:
Compliance check time reduced from one week to two hours.
Remediation time decreased by 80 %.
Audit pass rate reached 100 %.
Significant reduction in compliance risk and potential fines.
Best Practices
Performance Optimization
# ansible.cfg (high‑concurrency)
[defaults]
forks = 100
host_key_checking = False
gathering = smart
fact_caching = memory
fact_caching_timeout = 86400
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
control_path_dir = /tmp/.ansible-cp
pipelining = TrueUse asynchronous tasks for long‑running operations and poll for completion:
# Asynchronous backup example
- name: Run backup script asynchronously
shell: /opt/backup/backup-database.sh
async: 3600
poll: 0
register: backup_job
- name: Check backup status
async_status:
jid: "{{ backup_job.ansible_job_id }}"
register: backup_result
until: backup_result.finished
retries: 60
delay: 60Error Handling & Rollback
# Deployment with rollback block
- name: Deploy application with rollback
block:
- name: Snapshot current version
shell: cp -r /opt/app /opt/app.backup.{{ ansible_date_time.epoch }}
- name: Deploy new version
unarchive:
src: "{{ app_package }}"
dest: /opt/app
remote_src: yes
- name: Verify deployment
uri:
url: "http://localhost:8080/health"
status_code: 200
retries: 5
delay: 10
rescue:
- name: Roll back to previous version
shell: |
rm -rf /opt/app
mv /opt/app.backup.{{ ansible_date_time.epoch }} /opt/app
systemctl restart app
- name: Send alert email
mail:
to: [email protected]
subject: "Deployment Failed on {{ inventory_hostname }}"
body: "Deployment failed and rolled back automatically"
always:
- name: Clean temporary files
file:
path: "/tmp/deployment-{{ ansible_date_time.epoch }}"
state: absentMonitoring & Logging Integration
# roles/monitoring/tasks/main.yml
---
- name: Install node_exporter
package:
name: node_exporter
state: present
- name: Deploy Prometheus service file
template:
src: node_exporter.service.j2
dest: /etc/systemd/system/node_exporter.service
notify: restart node_exporter
- name: Push deployment metrics to Prometheus Pushgateway
uri:
url: "{{ prometheus_pushgateway_url }}"
method: POST
body: |
ansible_deployment_total{job="ansible",instance="{{ inventory_hostname }}"} 1
ansible_deployment_timestamp{job="ansible",instance="{{ inventory_hostname }}"} {{ ansible_date_time.epoch }}Test‑Driven Infrastructure with Molecule
# molecule/default/molecule.yml
---
dependency:
name: galaxy
driver:
name: docker
platforms:
- name: instance
image: centos:8
pre_build_image: true
provisioner:
name: ansible
playbooks:
converge: converge.yml
verify: verify.yml
verifier:
name: ansible # molecule/default/verify.yml
---
- name: Verify nginx installation
hosts: all
tasks:
- name: Check nginx package
package:
name: nginx
state: present
check_mode: yes
register: nginx_installed
- name: Verify nginx service
systemd:
name: nginx
state: started
check_mode: yes
register: nginx_running
- name: Test website response
uri:
url: http://localhost:80
return_content: yes
register: website_response
- name: Assertions
assert:
that:
- nginx_installed is not changed
- nginx_running is not changed
- website_response.status == 200Conclusion
Ansible enables 5‑10× faster deployments, reduces human error by >90 %, cuts operational costs by 50‑70 %, and raises system availability above 99.9 %.
Future directions include deeper AIOps integration, enhanced cloud‑native support, expanded security automation, and edge‑computing management. Organizations should adopt standardized automation pipelines, invest in observability, prioritize security/compliance automation, and foster a DevOps culture to stay competitive in digital transformation.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
