Mastering Ansible: From Core Concepts to Custom Modules and Plugins
Explore a comprehensive guide to advancing your Ansible skills, covering inventory formats, playbook structures, variable precedence, Jinja2 templating, custom lookup and callback plugins, dynamic inventory scripting, asynchronous execution, performance tuning with Mitogen, and enterprise‑grade platform design with AWX/Tower.
1. Overview: From Configuration Management to Automation Platform
Ansible remains a core tool for operations automation in 2026. Unlike Terraform's declarative infrastructure provisioning, Ansible focuses on operational automation—applying configuration changes, deploying applications, and executing batch operations on existing infrastructure.
Learning Ansible typically follows three stages: (1) mastering basic syntax to write Playbooks for common tasks; (2) understanding advanced features such as variables, templates, conditional execution, and loops to create reusable roles; (3) developing custom modules and plugins to extend Ansible to any automation scenario. This article concentrates on stages two and three.
2. Core Ansible Concepts Review: Inventory, Playbook, Role
2.1 Inventory Formats
Inventory defines the list of managed hosts and can be a static file (INI or YAML) or a dynamic script.
# hosts.ini - INI static inventory
[webservers]
web01.example.com ansible_host=10.112.0.11 ansible_port=22
web02.example.com ansible_host=10.112.0.12 ansible_port=22
web03.example.com ansible_host=10.112.0.13
[webservers:vars]
ansible_user=deploy
ansible_ssh_private_key_file=/home/deploy/.ssh/id_rsa
http_port=8080
[dbservers]
db01.example.com ansible_host=10.112.0.21
db02.example.com ansible_host=10.112.0.22
[production:children]
webservers
dbservers
[production:vars]
ansible_connection=ssh
ansible_ssh_common_args='-o StrictHostKeyChecking=no' # hosts.yaml - YAML inventory (supported from Ansible 2.4+)
all:
children:
webservers:
hosts:
web01.example.com:
ansible_host: 10.112.0.11
ansible_port: 22
web02.example.com:
ansible_host: 10.112.0.12
ansible_port: 22
vars:
http_port: 8080
dbservers:
hosts:
db01.example.com:
ansible_host: 10.112.0.21
db02.example.com:
ansible_host: 10.112.0.22
vars:
ansible_user: deploy2.2 Playbook Basic Structure
---
- name: Deploy Nginx front‑end service
hosts: webservers
become: yes
gather_facts: yes
vars:
nginx_version: "1.26.2"
nginx_port: 8080
pre_tasks:
# Ensure EPEL repository exists
- name: Ensure EPEL repo
ansible.builtin.yum:
name: epel-release
state: present
when: ansible_facts['os_family'] == "RedHat"
roles:
- nginx
tasks:
- name: Deploy Nginx upstream configuration
ansible.builtin.template:
src: upstream.conf.j2
dest: /etc/nginx/conf.d/upstream.conf
mode: '0644'
notify: restart nginx
- name: Verify Nginx configuration syntax
ansible.builtin.command:
cmd: nginx -t
creates: /usr/sbin/nginx
handlers:
- name: restart nginx
ansible.builtin.service:
name: nginx
state: restarted
enabled: yes2.3 Role Directory Structure
A Role is Ansible's unit of code reuse. A standard Role layout looks like:
roles/
nginx/
defaults/ # default variables (lowest priority)
main.yml
files/ # static files referenced directly
nginx.conf
handlers/ # handlers for notifications
main.yml
meta/ # role dependencies and metadata
main.yml
tasks/ # list of tasks
main.yml
templates/ # Jinja2 templates
nginx.conf.j2
upstream.conf.j2
vars/ # high‑priority variables
main.yml3. Deep Dive into Variables, Templates, and Conditional Execution
3.1 Variable Precedence and Scope
Variable precedence in Ansible (from low to high) is:
10: command‑line defaults (ansible‑playbook -e key=value)
20: role defaults (roles/role_name/defaults/main.yml)
21: inventory file/script host variables
22: inventory group variables
23: playbook group variables
24: host facts (gathered)
25: play vars / vars_prompt / vars_files
26: role vars (roles/role_name/vars/main.yml)
27: block vars (task level)
28: set_facts / registered vars
30: ansible.cfg command‑line values
40: role and include params
50: extra vars (-e) – always highestThe key principle is that when the same variable name appears in multiple scopes, the value from the higher‑priority scope wins. A common pitfall is defining a variable in the inventory that gets overridden by a vars block inside a Playbook.
3.2 Advanced Jinja2 Usage
Jinja2 is the templating engine used by Ansible's template task. Mastering filters, loops, conditionals, and expressions is essential for high‑quality Playbooks.
{# nginx.conf.j2 #}
{% for upstream in upstream_servers %}
upstream {{ upstream.name }} {
server {{ upstream.host }}:{{ upstream.port }};
}
{% endfor %}
server {
listen {{ nginx_port }}{% if ssl_enabled %} ssl{% endif %};
server_name {{ server_name | default('www.example.com') }};
access_log /var/log/nginx/{{ ansible_hostname }}-access.log {{ log_format | default('combined') }};
location / {
root {{ document_root | ternary('/var/www/html', '/usr/share/nginx/html') }};
index {{ default_index | join(' ') }};
}
{% if custom_error_pages is defined %}
error_page 500 502 503 504 /50x.html;
location = /50x.html { root {{ custom_error_pages }}; }
{% endif %}
{% set filtered_backends = backends | selectattr('enabled', 'equalto', true) | list %}
server_names:
{% for name in server_name_list | map('regex_replace', '^', 'www.') | sort %}
- {{ name }}
{% endfor %}
}Common Jinja2 filters include default(), map(), selectattr(), regex_replace(), ternary(), and json_query().
4. Lookup Plugins and External Data Integration
4.1 Lookup Plugin Overview
Lookup plugins fetch data from external sources and inject it into a Playbook. Unlike facts, lookups can be used in any context.
tasks:
# Read SSH public key from a file
- name: Read SSH public key
ansible.builtin.set_fact:
ssh_key: "{{ lookup('ansible.builtin.file', '/home/deploy/.ssh/id_rsa.pub') }}"
# Read AWS credentials from environment variables
- name: Read AWS credentials
ansible.builtin.debug:
msg: "{{ lookup('ansible.builtin.env', 'AWS_ACCESS_KEY_ID') }}"
# Retrieve database password from AWS Secrets Manager
- name: Get DB password
ansible.builtin.set_fact:
db_password: "{{ lookup('amazon.aws.aws_secret', 'prod/db/password', region='cn-north-1') }}"
# Retrieve secret from HashiCorp Vault
- name: Get Vault key
ansible.builtin.set_fact:
api_key: "{{ lookup('community.hashi_vault.hashi_vault', 'secret=prod/api/key') }}"
# Read a property from an INI file
- name: Read JDBC URL from properties file
ansible.builtin.set_fact:
jdbc_url: "{{ lookup('ansible.builtin.ini', 'jdbc.url', file='db.properties', section='database') }}"4.2 Custom Lookup Plugin
A custom lookup can connect to any data source. Example: a CMDB lookup.
# lookup_plugins/cmdb.py
from ansible.plugins.lookup import LookupBase
from ansible.errors import AnsibleError
import requests
class LookupModule(LookupBase):
def run(self, terms, variables=None, **kwargs):
"""Fetch server information from an internal CMDB system.
Usage: {{ lookup('cmdb', 'server', env='production', role='web') }}"""
self.set_options(direct=kwargs)
cmdb_url = self.get_option('url')
env = self.get_option('env')
role = terms[0] if terms else None
try:
params = {'env': env, 'role': role}
response = requests.get(f"{cmdb_url}/api/v1/servers", params=params, timeout=10)
response.raise_for_status()
data = response.json()
return [data]
except requests.RequestException as e:
raise AnsibleError(f"CMDB query failed: {str(e)}")
def get_option(self, option):
if hasattr(self, '_options'):
return self._options.get(option)
return self.get_option(option)5. Custom Module Development: From Python Script to Ansible Module
5.1 Why Create Custom Modules
Although Ansible ships with over 3000 built‑in and community modules, real‑world environments often require functionality that does not exist—integrating internal CMDBs, handling proprietary middleware, or encapsulating business logic. Custom modules make these repetitive tasks idempotent, reusable, and version‑controlled.
5.2 Module Basic Structure
#!/usr/bin/python3
# library/my_deploy.py
# Custom deployment module: deploy an application package to a target host
from __future__ import absolute_import, division, print_function
__metaclass__ = type
from ansible.module_utils.basic import AnsibleModule
import os, shutil, hashlib, subprocess, time
def get_file_md5(filepath):
"""Calculate MD5 of a file"""
md5 = hashlib.md5()
with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
md5.update(chunk)
return md5.hexdigest()
def deploy_application(module):
"""Execute application deployment and return changed, msg, meta"""
deploy_path = module.params['deploy_path']
version = module.params['version']
artifact_url = module.params['artifact_url']
backup_enabled = module.params['backup']
health_check_url = module.params['health_check_url']
result = {'changed': False, 'msg': '', 'version': version, 'meta': {}}
current_symlink = os.path.join(deploy_path, 'current')
if os.path.islink(current_symlink):
current_version = os.readlink(current_symlink)
if current_version == version:
result['msg'] = f"Version {version} already deployed, skipping"
result['meta']['skipped'] = True
return result
if os.path.exists(current_symlink) and backup_enabled:
backup_dir = os.path.join(deploy_path, 'backup')
os.makedirs(backup_dir, exist_ok=True)
old_version = os.readlink(current_symlink)
backup_path = os.path.join(backup_dir, f"{old_version}_{int(time.time())}")
os.rename(current_symlink, backup_path)
result['meta']['backup'] = backup_path
version_dir = os.path.join(deploy_path, 'versions', version)
os.makedirs(version_dir, exist_ok=True)
artifact_path = os.path.join(version_dir, 'artifact.tar.gz')
if not os.path.exists(artifact_path):
module.run_command(['curl', '-fSL', '-o', artifact_path, artifact_url], check_mode=False)
rc, stdout, stderr = module.run_command(['tar', '-xzf', artifact_path, '-C', version_dir])
if rc != 0:
module.fail_json(msg=f"Artifact extraction failed: {stderr}", cmd=' '.join(['tar', '-xzf', artifact_path, '-C', version_dir]), rc=rc)
try:
os.remove(current_symlink)
except FileNotFoundError:
pass
os.symlink(version, current_symlink)
if health_check_url:
max_retries = 10
for i in range(max_retries):
rc, _, _ = module.run_command(['curl', '-sf', '--max-time', '5', health_check_url])
if rc == 0:
break
time.sleep(2)
else:
module.run_command(['ln', '-sfn', result['meta'].get('backup', ''), current_symlink])
module.fail_json(msg=f"Health check failed: {health_check_url}, rolled back")
result['changed'] = True
result['msg'] = f"Successfully deployed version {version} to {deploy_path}"
result['meta']['deploy_path'] = version_dir
return result
def main():
module = AnsibleModule(
argument_spec={
'deploy_path': {'type': 'str', 'required': True, 'description': 'Deployment root path'},
'version': {'type': 'str', 'required': True, 'description': 'Target version'},
'artifact_url': {'type': 'str', 'required': True, 'description': 'URL of the artifact'},
'backup': {'type': 'bool', 'default': True, 'description': 'Enable backup'},
'health_check_url': {'type': 'str', 'default': None, 'description': 'Health‑check URL'}
},
required_together=[('deploy_path', 'version', 'artifact_url')],
supports_check_mode=True
)
if module.check_mode:
module.exit_json(changed=True, msg="Check mode: deployment would run", check_mode=True)
result = deploy_application(module)
module.exit_json(**result)
if __name__ == '__main__':
main()5.3 Using the Custom Module in a Playbook
# deploy-app.yml
---
- name: Deploy order service
hosts: appservers
become: yes
vars:
deploy_root: /opt/app
artifact_base_url: https://artifacts.example.com/order-service
tasks:
- name: Deploy order service
my_deploy:
deploy_path: "{{ deploy_root }}"
version: "{{ app_version }}"
artifact_url: "{{ artifact_base_url }}/{{ app_version }}/order-service.tar.gz"
backup: yes
health_check_url: "http://localhost:8080/health"
register: deploy_result
- name: Show deployment result
ansible.builtin.debug:
msg: |
Deployment status: {{ deploy_result.changed }}
Message: {{ deploy_result.msg }}
Version: {{ deploy_result.version }}5.4 Structured Return Values
Modules should use exit_json() for success and fail_json() for errors. A well‑designed return payload enables downstream Playbook logic to make decisions.
# Successful return
module.exit_json(
changed=True,
msg="Deployment complete",
meta={
"version": "v2.4.1",
"elapsed_seconds": 45,
"artifact_md5": "abc123def456",
"deployed_files": 127,
"skipped": False
}
)
# Failure return
module.fail_json(
msg="Unable to connect to artifact repository",
error_code="ARTIFACT_DOWNLOAD_FAILED",
details={"url": artifact_url, "rc": 22}
)6. Dynamic Inventory and Enterprise‑Scale Management
6.1 Dynamic Inventory Script
Static inventories do not scale for large cloud environments. A dynamic script can query cloud provider APIs to build the host list in real time.
# inventory/ec2_inventory.py
#!/usr/bin/python3
import boto3, json, argparse, sys
def get_ec2_instances(region, filters=None):
"""Fetch EC2 instances from AWS"""
ec2 = boto3.client('ec2', region_name=region)
instances = []
paginator = ec2.get_paginator('describe_instances')
for page in paginator.paginate(Filters=filters or []):
for reservation in page['Reservations']:
for instance in reservation['Instances']:
if instance['State']['Name'] == 'running':
instances.append(instance)
return instances
def build_inventory(instances):
"""Construct Ansible inventory structure"""
inventory = {'_meta': {'hostvars': {}}}
for instance in instances:
tags = instance.get('Tags', [])
group_name = 'tag_Role_Unknown'
for tag in tags:
if tag['Key'] == 'Role':
group_name = f"tag_Role_{tag['Value']}"
break
inventory.setdefault(group_name, {'hosts': []})['hosts'].append(instance['PrivateDnsName'])
inventory['_meta']['hostvars'][instance['PrivateDnsName']] = {
'ansible_host': instance['PrivateIpAddress'],
'ansible_user': 'ec2-user',
'ec2_instance_id': instance['InstanceId'],
'ec2_instance_type': instance['InstanceType'],
'ec2_region': instance['Placement']['AvailabilityZone'][:-1],
'ec2_vpc_id': instance['VpcId'],
'security_groups': [sg['GroupId'] for sg in instance.get('SecurityGroups', [])]
}
return inventory
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--list', action='store_true')
parser.add_argument('--host')
parser.add_argument('--region', default='cn-north-1')
args = parser.parse_args()
filters = [
{'Name': 'instance-state-name', 'Values': ['running']},
{'Name': 'tag:AnsibleManaged', 'Values': ['true']}
]
instances = get_ec2_instances(args.region, filters)
inventory = build_inventory(instances)
if args.list:
print(json.dumps(inventory))
elif args.host:
print(json.dumps(inventory['_meta']['hostvars'].get(args.host, {})))
else:
print("Use --list or --host")
sys.exit(1)
if __name__ == '__main__':
main()6.2 Inventory Caching
Repeated calls to cloud APIs add latency. Enabling cache reduces the number of API requests.
# ansible.cfg
[defaults]
inventory = inventory/ec2_inventory.py --region cn-north-1
inventory_ignore_extensions = .pyc,.pyo,.ini
# Cache configuration
cache_plugin = jsonfile
cache_plugin_connection = /tmp/ansible_inventory_cache
cache_timeout = 3600 # 1 hour
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 864007. Callback Plugins and Event‑Driven Automation
7.1 Callback Plugin Overview
Callbacks let you inject custom logic at key Ansible events, such as sending notifications, masking sensitive data, measuring execution time, or persisting results.
# callback_plugins/execution_stats.py
from __future__ import absolute_import, division, print_function
__metaclass__ = type
from ansible.plugins.callback import CallbackBase
from datetime import datetime
import json
class CallbackModule(CallbackBase):
"""Collect execution statistics for a Playbook"""
CALLBACK_VERSION = 2.0
CALLBACK_NAME = 'execution_stats'
CALLBACK_NEED_WHITELIST = []
def __init__(self):
super(CallbackModule, self).__init__()
self.task_stats = {}
self.host_stats = {}
self.playbook_start = None
self.playbook_name = None
def v2_playbook_on_start(self, playbook):
self.playbook_start = datetime.now()
self.playbook_name = playbook.get_name()
def v2_runner_on_ok(self, result, **kwargs):
host = result._host.get_name()
if host not in self.host_stats:
self.host_stats[host] = {'ok': 0, 'changed': 0, 'failed': 0}
if result._result.get('changed'):
self.host_stats[host]['changed'] += 1
else:
self.host_stats[host]['ok'] += 1
def v2_runner_on_failed(self, result, **kwargs):
host = result._host.get_name()
if host not in self.host_stats:
self.host_stats[host] = {'ok': 0, 'changed': 0, 'failed': 0}
self.host_stats[host]['failed'] += 1
def v2_playbook_on_stats(self, stats):
if not self.playbook_start:
return
elapsed = (datetime.now() - self.playbook_start).total_seconds()
summary = {'playbook': self.playbook_name, 'elapsed_seconds': elapsed, 'hosts': {}}
for host, data in stats.get_stats().items():
summary['hosts'][host] = {'ok': data.ok, 'changed': data.changed, 'failed': data.failures, 'unreachable': data.unreachable}
report_file = f"/tmp/ansible_stats_{int(self.playbook_start.timestamp())}.json"
with open(report_file, 'w') as f:
json.dump(summary, f, indent=2)
self._display.display('
' + '='*60)
self._display.display(f"Playbook execution stats: {self.playbook_name}")
self._display.display(f"Total time: {elapsed:.2f}s")
self._display.display(f"Host stats: {json.dumps(self.host_stats, indent=2)}")
self._display.display('='*60 + '
')7.2 DingTalk Notification Callback
Example of a callback that posts a markdown message to a DingTalk webhook after a Playbook finishes.
# callback_plugins/dingtalk.py
from __future__ import absolute_import, division, print_function
__metaclass__ = type
import json, os, requests
from ansible.plugins.callback import CallbackBase
class CallbackModule(CallbackBase):
CALLBACK_NAME = 'dingtalk_notify'
def __init__(self, *args, **kwargs):
super(CallbackModule, self).__init__(*args, **kwargs)
self.webhook_url = os.environ.get('DINGTALK_WEBHOOK_URL')
self.enabled = bool(self.webhook_url)
def send_message(self, msg):
if not self.enabled:
return
payload = {
'msgtype': 'markdown',
'markdown': {'title': 'Ansible Task Notification', 'text': msg}
}
try:
requests.post(self.webhook_url, json=payload, timeout=5)
except Exception as e:
self._display.warning(f"DingTalk notification failed: {e}")
def v2_playbook_on_stats(self, stats):
if not self.enabled:
return
results = {host: {'ok': data.ok, 'changed': data.changed, 'failed': data.failures} for host, data in stats.get_stats().items()}
failed_hosts = [h for h, d in results.items() if d['failed'] > 0]
title = "Ansible execution failed" if failed_hosts else "Ansible execution succeeded"
msg = f"### {title}
- Execution time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
- Host count: {len(results)}
"
if failed_hosts:
msg += f"- Failed hosts: {', '.join(failed_hosts)}
"
self.send_message(msg)8. Performance Optimization: Async, Strategy, Cache
8.1 Asynchronous Execution and Polling
By default Ansible runs tasks synchronously, waiting for all hosts to finish before moving on. For long‑running tasks, this creates a bottleneck.
Using async and poll lets a task run in the background on each host.
tasks:
# Long‑running DB backup executed asynchronously
- name: Execute full MongoDB backup
community.mongodb.mongodb_backup:
backup_type: full
dest_path: /backup/mongo
async: 3600 # max runtime in seconds
poll: 0 # 0 = fire‑and‑forget
notify: Verify backup
# Parallel installation of unrelated packages
- name: Install basic packages in parallel
ansible.builtin.yum:
name:
- curl
- wget
- htop
- sysstat
state: present
async: 600
poll: 0
loop: "{{ groups['all'] }}"
loop_control:
label: "{{ inventory_hostname }}"
# Wait for asynchronous job status
- name: Get backup job status
async_status:
jid: "{{ backup_jid.ansible_job_id }}"
register: backup_result
until: backup_result.finished
retries: 120
delay: 308.2 Strategy Modes
Ansible's default linear strategy runs a task on all hosts before proceeding. The free strategy lets each host run tasks independently, improving throughput.
# ansible.cfg
[defaults]
# free strategy: hosts execute tasks at full speed independently
# strategy = free
# Increase parallelism
# forks = 100
[pipeline]
# Enable stage parallelism (default 4)
# step = 48.3 Mitogen for Ansible – Speed Boost
Mitogen provides a C‑level optimization layer that reuses persistent SSH connections, eliminating the overhead of establishing a new connection for every task.
# Install Mitogen
pip install mitogen
# Enable Mitogen in ansible.cfg
[defaults]
strategy_plugins = /usr/local/lib/python3.11/site-packages/ansible_mitogen/plugins/strategy
strategy = mitogen_linear
[mitogen]
max_children = 32 # connection pool size
connect_timeout = 30Benchmarks show a 5‑10× speed increase for typical workloads (e.g., 100 hosts apt‑update reduced from 180 s to 25 s).
9. Enterprise‑Grade Automation Platform Architecture
9.1 AWX/Tower Architecture
For team‑based operations, AWX (the open‑source upstream of Ansible Tower) provides a web UI, role‑based access, job scheduling, and execution environments.
Enterprise Ansible Platform:
+---------------------------+ Git repository (Playbooks, Roles, Inventory, Variables)
| Git Repo | ──► Webhook triggers AWX/Tower
+---------------------------+ ──► Job Templates, Workflows, Schedules
│
▼
+-------------------+
| AWX / Tower |
| ┌───────────────┐ |
| │ Job Templates │ |
| │ Workflows │ |
| │ Schedules │ |
| └───────────────┘ |
| ┌───────────────┐ |
| │ Projects │ |
| │ Inventories │ |
| │ Credentials │ |
| └───────────────┘ |
+-------------------+
│
▼
Execution Nodes (SSH to target hosts)Key components:
Projects : Git repositories containing Playbooks and Roles; AWX syncs automatically via webhooks.
Workflows : Chain multiple Job Templates with conditional branching and parallel execution.
Execution Environments : Container images that bundle Ansible Runner and required Python collections, ensuring consistent runtime.
9.2 Credential and Secret Management
Credentials store connection details. Best practice is to link AWX credentials to external secret stores such as HashiCorp Vault, avoiding plaintext passwords in the AWX database.
# credentials.yml (example for AWX)
apiVersion: v1
kind: Secret
metadata:
name: prod-db-credential
namespace: awx
type: Opaque
stringData:
username: dbadmin
password: "Vault should manage this in production"
---
# Recommended: configure a Vault Credential in AWX and fetch passwords at runtime10. Conclusion
This guide presented a complete advancement path for Ansible, from core syntax to custom module development and enterprise platform design. Key takeaways include:
Variable precedence pitfalls : Approximately 60 % of newcomers encounter errors due to misunderstanding variable scopes. Remember that higher‑priority scopes override lower ones.
Value of custom modules : Encapsulating repetitive manual steps into idempotent modules reduces execution time from minutes to seconds, eliminates a 3‑5 % error rate, and provides audit‑able results.
Mitogen performance gains : Real‑world tests show a 7‑8× speedup for large‑scale operations by reusing SSH connections.
Asynchronous strategies : Using async + poll: 0 allows fast hosts to continue work while long‑running tasks execute in the background, dramatically improving overall throughput.
By treating automation knowledge as versioned, testable code, organizations achieve faster change delivery, higher consistency, and reliable rollback capabilities.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
