Operations 25 min read

Mastering Enterprise CI/CD with Ansible: A Complete Hands‑On Guide

This comprehensive guide explains how to build an enterprise‑grade CI/CD automation platform with Ansible, covering its evolution, core principles, environment setup, dynamic inventory, modular playbooks, GitLab integration, blue‑green deployments, Vault security, custom module development, real‑world case studies, performance tuning, error handling, monitoring, and testing with Molecule.

Raymond Ops
Raymond Ops
Raymond Ops
Mastering Enterprise CI/CD with Ansible: A Complete Hands‑On Guide

Overview

Ansible provides an agentless, idempotent, and declarative automation framework that can be used to build enterprise‑grade CI/CD pipelines and infrastructure management solutions.

Core Principles

Agentless Architecture

# Ansible connects via SSH
ansible all -m ping -i inventory.ini
# No additional software required on target hosts

Idempotency

# Example: ensure Nginx is installed and started
- name: Ensure nginx is installed and started
  systemd:
    name: nginx
    state: started
    enabled: yes
# Re‑running the playbook yields the same state

Declarative YAML Syntax

# Simple playbook fragment
- hosts: webservers
  tasks:
    - name: Install nginx
      package:
        name: nginx
        state: present

Infrastructure Setup

Control‑Node Installation

# CentOS/RHEL
sudo yum install epel-release
sudo yum install ansible

# Ubuntu/Debian
sudo apt update
sudo apt install ansible

# Latest version via pip
pip3 install ansible ansible-core

# Verify
ansible --version

ansible.cfg Performance Tuning

[defaults]
forks = 50
host_key_checking = False
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null
pipelining = True
gathering = smart
fact_caching = memory
fact_caching_timeout = 86400
log_path = /var/log/ansible.log

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
control_path = /tmp/ansible-ssh-%%h-%%p-%%r

Dynamic Inventory (Python Example)

# inventory/dynamic_inventory.py
#!/usr/bin/env python3
import json, requests
from argparse import ArgumentParser

class DynamicInventory:
    def __init__(self):
        self.inventory = {}
        self.read_cli_args()
        if self.args.list:
            self.inventory = self.get_inventory()
        elif self.args.host:
            self.inventory = self.get_host_info(self.args.host)
        print(json.dumps(self.inventory))

    def get_inventory(self):
        try:
            response = requests.get('http://cmdb.company.com/api/hosts')
            hosts_data = response.json()
            inventory = {'_meta': {'hostvars': {}}, 'webservers': {'hosts': []}, 'databases': {'hosts': []}, 'loadbalancers': {'hosts': []}}
            for host in hosts_data:
                group = host['role']
                if group in inventory:
                    inventory[group]['hosts'].append(host['hostname'])
                    inventory['_meta']['hostvars'][host['hostname']] = {
                        'ansible_host': host['ip_address'],
                        'environment': host['environment'],
                        'datacenter': host['datacenter']
                    }
            return inventory
        except Exception:
            return {'_meta': {'hostvars': {}}}

    def get_host_info(self, hostname):
        return {}

    def read_cli_args(self):
        parser = ArgumentParser()
        parser.add_argument('--list', action='store_true')
        parser.add_argument('--host', action='store')
        self.args = parser.parse_args()

if __name__ == '__main__':
    DynamicInventory()

Playbook Architecture

Directory Layout

ansible-infrastructure/
├── inventories/
│   ├── production/
│   │   ├── hosts.yml
│   │   └── group_vars/
│   ├── staging/
│   └── development/
├── roles/
│   ├── common/
│   ├── nginx/
│   ├── mysql/
│   └── monitoring/
├── playbooks/
│   ├── site.yml
│   ├── webservers.yml
│   └── databases.yml
├── group_vars/
├── host_vars/
└── ansible.cfg

Site Playbook (Orchestrates All Roles)

# playbooks/site.yml
---
- name: Common system configuration
  hosts: all
  become: yes
  roles:
    - common
    - security
    - monitoring-agent

- name: Web server configuration
  hosts: webservers
  become: yes
  roles:
    - nginx
    - php-fpm
    - ssl-certificates

- name: Database server configuration
  hosts: databases
  become: yes
  roles:
    - mysql
    - backup
    - performance-tuning

- name: Load balancer configuration
  hosts: loadbalancers
  become: yes
  roles:
    - haproxy
    - keepalived

NGINX Role – Tasks

# roles/nginx/tasks/main.yml
---
- name: Install nginx
  package:
    name: nginx
    state: present
  notify: restart nginx

- name: Create configuration directories
  file:
    path: "{{ item }}"
    state: directory
    owner: root
    group: root
    mode: '0755'
  loop:
    - /etc/nginx/sites-available
    - /etc/nginx/sites-enabled
    - /var/log/nginx

- name: Deploy main nginx.conf
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    backup: yes
  notify: reload nginx
  tags: config

- name: Deploy virtual hosts
  template:
    src: vhost.conf.j2
    dest: "/etc/nginx/sites-available/{{ item.name }}"
  loop: "{{ nginx_vhosts }}"
  notify: reload nginx
  tags: vhosts

- name: Enable virtual hosts
  file:
    src: "/etc/nginx/sites-available/{{ item.name }}"
    dest: "/etc/nginx/sites-enabled/{{ item.name }}"
    state: link
  loop: "{{ nginx_vhosts }}"
  when: item.enabled | default(true)
  notify: reload nginx

- name: Ensure nginx service is running
  systemd:
    name: nginx
    state: started
    enabled: yes

NGINX Role – Default Variables

# roles/nginx/defaults/main.yml
---
nginx_user: www-data
nginx_worker_processes: auto
nginx_worker_connections: 1024
nginx_keepalive_timeout: 65
nginx_client_max_body_size: 64m

nginx_vhosts:
  - name: default
    listen: 80
    server_name: _
    root: /var/www/html
    index: "index.html index.htm"
    enabled: true

nginx_performance:
  sendfile: "on"
  tcp_nopush: "on"
  tcp_nodelay: "on"
  gzip: "on"
  gzip_vary: "on"
  gzip_comp_level: 6

CI/CD Integration

GitLab CI Pipeline

# .gitlab-ci.yml
stages:
  - validate
  - test
  - deploy-staging
  - deploy-production

variables:
  ANSIBLE_HOST_KEY_CHECKING: "False"
  ANSIBLE_FORCE_COLOR: "True"

validate-playbooks:
  stage: validate
  image: ansible/ansible-runner:latest
  script:
    - ansible-playbook --syntax-check playbooks/site.yml
    - ansible-lint playbooks/site.yml
  only:
    - merge_requests
    - master

test-roles:
  stage: test
  image: ansible/ansible-runner:latest
  script:
    - molecule test
  only:
    - merge_requests

deploy-staging:
  stage: deploy-staging
  image: ansible/ansible-runner:latest
  script:
    - ansible-playbook -i inventories/staging playbooks/site.yml --check --diff
    - ansible-playbook -i inventories/staging playbooks/site.yml
  only:
    - master

deploy-production:
  stage: deploy-production
  image: ansible/ansible-runner:latest
  script:
    - ansible-playbook -i inventories/production playbooks/site.yml --check --diff
    - ansible-playbook -i inventories/production playbooks/site.yml
  when: manual
  only:
    - master

Blue‑Green Deployment Playbook

# playbooks/blue-green-deploy.yml
---
- name: Blue‑Green deployment
  hosts: webservers
  serial: "{{ batch_size | default(1) }}"
  vars:
    current_color: "{{ ansible_local.deployment.color | default('blue') }}"
    new_color: "{{ 'green' if current_color == 'blue' else 'blue' }}"
  tasks:
    - name: Determine deployment path
      set_fact:
        deploy_path: "/opt/app/{{ new_color }}"

    - name: Create new version directory
      file:
        path: "{{ deploy_path }}"
        state: directory

    - name: Deploy new package
      unarchive:
        src: "{{ app_package_url }}"
        dest: "{{ deploy_path }}"
        remote_src: yes

    - name: Render configuration
      template:
        src: app.conf.j2
        dest: "{{ deploy_path }}/config/app.conf"

    - name: Health check new version
      uri:
        url: "http://{{ ansible_host }}:{{ app_port }}/health"
        method: GET
        timeout: 30
      register: health_check
      retries: 5
      delay: 10

    - name: Update load‑balancer upstream
      template:
        src: nginx-upstream.j2
        dest: /etc/nginx/conf.d/upstream.conf
      delegate_to: "{{ groups['loadbalancers'] }}"
      notify: reload nginx

    - name: Record deployment state
      copy:
        content: |
          [deployment]
          color={{ new_color }}
          version={{ app_version }}
          timestamp={{ ansible_date_time.epoch }}
        dest: /etc/ansible/facts.d/deployment.fact

Advanced Features

Vault for Sensitive Data

# Create an encrypted vault file
ansible-vault create group_vars/production/vault.yml

# Edit the vault file
ansible-vault edit group_vars/production/vault.yml

# Encrypt an existing file
ansible-vault encrypt inventories/production/secrets.yml

# Use vault variables in a playbook
ansible-playbook -i inventories/production playbooks/site.yml --ask-vault-pass

Typical decrypted content (for illustration):

# group_vars/production/vault.yml (after decryption)
vault_mysql_root_password: "SuperSecretPassword123!"
vault_api_key: "sk-1234567890abcdef"
vault_ssl_private_key: |
  -----BEGIN PRIVATE KEY-----
  MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQC7...
  -----END PRIVATE KEY-----

Custom Ansible Module (Service Health Check)

# library/service_check.py
#!/usr/bin/python3
from ansible.module_utils.basic import AnsibleModule
import requests, time

def check_service_health(url, timeout=30, retries=3):
    """Check service health with retries"""
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=timeout)
            if response.status_code == 200:
                return True, f"Service is healthy (status: {response.status_code})"
        except requests.exceptions.RequestException as e:
            if attempt == retries - 1:
                return False, f"Service check failed: {e}"
            time.sleep(5)
    return False, "Service health check failed after all retries"

def main():
    module = AnsibleModule(
        argument_spec=dict(
            url=dict(type='str', required=True),
            timeout=dict(type='int', default=30),
            retries=dict(type='int', default=3),
            expected_status=dict(type='int', default=200)
        ),
        supports_check_mode=True
    )
    url = module.params['url']
    timeout = module.params['timeout']
    retries = module.params['retries']
    if module.check_mode:
        module.exit_json(changed=False, msg="Check mode – would check service health")
    is_healthy, message = check_service_health(url, timeout, retries)
    if is_healthy:
        module.exit_json(changed=False, msg=message, status="healthy")
    else:
        module.fail_json(msg=message, status="unhealthy")

if __name__ == '__main__':
    main()

Case Studies

Large‑Scale Internet Company

Background: 3,000+ servers across web, database, cache, and messaging layers required unified automation.

Solution Architecture:

Layered environment definition (production, staging, development) with region and security level metadata.

# environments configuration
environments:
  - name: production
    regions: [us-west-1, us-east-1, eu-west-1]
    security_level: high
  - name: staging
    regions: [us-west-1]
    security_level: medium
  - name: development
    regions: [us-west-1]
    security_level: low

Service discovery via a custom Consul inventory plugin.

# plugins/inventory/consul_inventory.py
import consul, json
class ConsulInventory:
    def __init__(self):
        self.consul = consul.Consul()
        self.inventory = {'_meta': {'hostvars': {}}}
    def get_inventory(self):
        services = self.consul.catalog.services()[1]
        for service_name in services:
            nodes = self.consul.catalog.service(service_name)[1]
            if service_name not in self.inventory:
                self.inventory[service_name] = {'hosts': []}
            for node in nodes:
                hostname = node['Node']
                self.inventory[service_name]['hosts'].append(hostname)
                self.inventory['_meta']['hostvars'][hostname] = {
                    'ansible_host': node['Address'],
                    'service_port': node['ServicePort'],
                    'datacenter': node['Datacenter']
                }
        return self.inventory

Rolling‑update microservice deployment with pre‑ and post‑tasks for load‑balancer registration.

# playbooks/microservice-deploy.yml
---
- name: Microservice deployment
  hosts: "{{ service_name }}"
  serial: "{{ rolling_update_batch_size | default('25%') }}"
  max_fail_percentage: 10
  pre_tasks:
    - name: Remove node from LB
      uri:
        url: "http://{{ lb_host }}/api/v1/upstream/{{ service_name }}/remove"
        method: POST
        body_format: json
        body:
          server: "{{ ansible_host }}:{{ service_port }}"
      delegate_to: localhost
  tasks:
    - name: Stop old service
      systemd:
        name: "{{ service_name }}"
        state: stopped
    - name: Backup current version
      archive:
        path: "/opt/{{ service_name }}"
        dest: "/backup/{{ service_name }}-{{ ansible_date_time.epoch }}.tar.gz"
    - name: Deploy new version
      unarchive:
        src: "{{ artifact_url }}"
        dest: "/opt/{{ service_name }}"
        remote_src: yes
        owner: "{{ service_user }}"
        group: "{{ service_group }}"
    - name: Update configuration
      template:
        src: "{{ service_name }}.conf.j2"
        dest: "/opt/{{ service_name }}/config/app.conf"
      notify: restart {{ service_name }}
    - name: Start service
      systemd:
        name: "{{ service_name }}"
        state: started
        enabled: yes
    - name: Health check
      uri:
        url: "http://{{ ansible_host }}:{{ service_port }}/health"
        register: health_result
        retries: 10
        delay: 30
        until: health_result.status == 200
  post_tasks:
    - name: Re‑add node to LB
      uri:
        url: "http://{{ lb_host }}/api/v1/upstream/{{ service_name }}/add"
        method: POST
        body_format: json
        body:
          server: "{{ ansible_host }}:{{ service_port }}"
      delegate_to: localhost

Results:

Deployment time reduced from 2 hours to 15 minutes.

Success rate increased from 85 % to 99.5 %.

Operational labor cost cut by 60 %.

System availability rose to 99.99 %.

Financial Industry Compliance Automation

Background: A bank needed to meet PCI‑DSS, SOX, and related standards through automated checks and remediation.

Solution:

Security baseline enforcement (SSH hardening, firewall rules, disabling unnecessary services).

# roles/security-compliance/tasks/main.yml
---
- name: Enforce SSH configuration
  lineinfile:
    path: /etc/ssh/sshd_config
    regexp: "{{ item.regexp }}"
    line: "{{ item.line }}"
    state: present
  loop:
    - { regexp: '^Protocol', line: 'Protocol 2' }
    - { regexp: '^PermitRootLogin', line: 'PermitRootLogin no' }
    - { regexp: '^PasswordAuthentication', line: 'PasswordAuthentication no' }
    - { regexp: '^ClientAliveInterval', line: 'ClientAliveInterval 300' }
  notify: restart sshd

- name: Configure firewall services
  firewalld:
    service: "{{ item }}"
    permanent: yes
    state: enabled
    immediate: yes
  loop:
    - ssh
    - https

- name: Disable unnecessary services
  systemd:
    name: "{{ item }}"
    state: stopped
    enabled: no
  loop:
    - telnet
    - rsh
    - rlogin
  ignore_errors: yes

Compliance report generation (system facts, password policy, user accounts) rendered to HTML.

# playbooks/compliance-report.yml
---
- name: Generate compliance report
  hosts: all
  gather_facts: yes
  tasks:
    - name: Collect system information
      setup:
        gather_subset:
          - hardware
          - network
          - services
    - name: Check password policy
      shell: |
        grep -E '^PASS_MAX_DAYS|^PASS_MIN_DAYS|^PASS_WARN_AGE' /etc/login.defs
      register: password_policy
    - name: List regular user accounts
      shell: |
        awk -F: '($3 >= 1000) {print $1}' /etc/passwd
      register: user_accounts
    - name: Render compliance HTML report
      template:
        src: compliance-report.j2
        dest: "/tmp/compliance-report-{{ ansible_hostname }}.html"
      delegate_to: localhost

Results:

Compliance check time reduced from one week to two hours.

Remediation time decreased by 80 %.

Audit pass rate reached 100 %.

Significant reduction in compliance risk and potential fines.

Best Practices

Performance Optimization

# ansible.cfg (high‑concurrency)
[defaults]
forks = 100
host_key_checking = False
gathering = smart
fact_caching = memory
fact_caching_timeout = 86400

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
control_path_dir = /tmp/.ansible-cp
pipelining = True

Use asynchronous tasks for long‑running operations and poll for completion:

# Asynchronous backup example
- name: Run backup script asynchronously
  shell: /opt/backup/backup-database.sh
  async: 3600
  poll: 0
  register: backup_job

- name: Check backup status
  async_status:
    jid: "{{ backup_job.ansible_job_id }}"
  register: backup_result
  until: backup_result.finished
  retries: 60
  delay: 60

Error Handling & Rollback

# Deployment with rollback block
- name: Deploy application with rollback
  block:
    - name: Snapshot current version
      shell: cp -r /opt/app /opt/app.backup.{{ ansible_date_time.epoch }}
    - name: Deploy new version
      unarchive:
        src: "{{ app_package }}"
        dest: /opt/app
        remote_src: yes
    - name: Verify deployment
      uri:
        url: "http://localhost:8080/health"
        status_code: 200
        retries: 5
        delay: 10
  rescue:
    - name: Roll back to previous version
      shell: |
        rm -rf /opt/app
        mv /opt/app.backup.{{ ansible_date_time.epoch }} /opt/app
        systemctl restart app
    - name: Send alert email
      mail:
        to: [email protected]
        subject: "Deployment Failed on {{ inventory_hostname }}"
        body: "Deployment failed and rolled back automatically"
  always:
    - name: Clean temporary files
      file:
        path: "/tmp/deployment-{{ ansible_date_time.epoch }}"
        state: absent

Monitoring & Logging Integration

# roles/monitoring/tasks/main.yml
---
- name: Install node_exporter
  package:
    name: node_exporter
    state: present

- name: Deploy Prometheus service file
  template:
    src: node_exporter.service.j2
    dest: /etc/systemd/system/node_exporter.service
  notify: restart node_exporter

- name: Push deployment metrics to Prometheus Pushgateway
  uri:
    url: "{{ prometheus_pushgateway_url }}"
    method: POST
    body: |
      ansible_deployment_total{job="ansible",instance="{{ inventory_hostname }}"} 1
      ansible_deployment_timestamp{job="ansible",instance="{{ inventory_hostname }}"} {{ ansible_date_time.epoch }}

Test‑Driven Infrastructure with Molecule

# molecule/default/molecule.yml
---
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: instance
    image: centos:8
    pre_build_image: true
provisioner:
  name: ansible
  playbooks:
    converge: converge.yml
    verify: verify.yml
verifier:
  name: ansible
# molecule/default/verify.yml
---
- name: Verify nginx installation
  hosts: all
  tasks:
    - name: Check nginx package
      package:
        name: nginx
        state: present
      check_mode: yes
      register: nginx_installed
    - name: Verify nginx service
      systemd:
        name: nginx
        state: started
      check_mode: yes
      register: nginx_running
    - name: Test website response
      uri:
        url: http://localhost:80
        return_content: yes
      register: website_response
    - name: Assertions
      assert:
        that:
          - nginx_installed is not changed
          - nginx_running is not changed
          - website_response.status == 200

Conclusion

Ansible enables 5‑10× faster deployments, reduces human error by >90 %, cuts operational costs by 50‑70 %, and raises system availability above 99.9 %.

Future directions include deeper AIOps integration, enhanced cloud‑native support, expanded security automation, and edge‑computing management. Organizations should adopt standardized automation pipelines, invest in observability, prioritize security/compliance automation, and foster a DevOps culture to stay competitive in digital transformation.

CI/CDConfiguration ManagementInfrastructure as CodeAnsibleVaultCustom Modules
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.