How to Combine Terraform and Ansible for Seamless Multi‑Cloud Orchestration
This guide explains why single‑tool approaches fall short in modern IaC, compares Terraform’s state management and multi‑cloud support with Ansible’s configuration capabilities, and provides a step‑by‑step architecture, code samples, CI/CD integration, monitoring, cost‑saving, and security practices for enterprise‑grade deployments.
In the era of cloud‑native transformation, manual operations can no longer keep pace with business demands. The author, an experienced site reliability engineer, shares a comprehensive method that merges Terraform’s declarative infrastructure provisioning with Ansible’s imperative configuration management to achieve a robust, multi‑cloud orchestration workflow.
Pain Point Insight
Terraform excels at resource creation, state tracking, dependency resolution, and supports major cloud providers, but it struggles with complex configuration scripts and lacks fine‑grained configuration management. Ansible offers idempotent operations, a rich module ecosystem, and dynamic inventories, yet it does not manage infrastructure state.
Architecture Design: Layered Decoupling
┌─────────────────────────────────────────┐
│ GitOps Workflow │
├─────────────────────────────────────────┤
│ Terraform Layer (Infrastructure) │
│ ├── Network (VPC/Subnet/Security Group)│
│ ├── Compute (EC2/ECS/Lambda) │
│ └── Storage (S3/RDS/ElastiCache) │
├─────────────────────────────────────────┤
│ Ansible Layer (Configuration) │
│ ├── System (users/permissions/services)│
│ ├── Application (containers/micro‑services)│
│ └── Operations (logging/alerts/backup) │
└─────────────────────────────────────────┘Real‑World Multi‑Cloud Deployment Example
Step 1 – Terraform defines the base infrastructure
# main.tf – multi‑cloud definition
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
alicloud = { source = "aliyun/alicloud", version = "~> 1.200" }
}
backend "s3" {
bucket = "terraform-state-prod"
key = "ecommerce/infrastructure.tfstate"
region = "us-west-2"
}
}
module "aws_infrastructure" {
source = "./modules/aws"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
enable_ansible_inventory = true
}
module "alicloud_infrastructure" {
source = "./modules/alicloud"
vpc_cidr = "172.16.0.0/16"
zones = ["cn-hangzhou-g", "cn-hangzhou-h"]
enable_ansible_inventory = true
}
resource "local_file" "ansible_inventory" {
content = templatefile("${path.module}/templates/inventory.tpl", {
aws_instances = module.aws_infrastructure.instance_ips
ali_instances = module.alicloud_infrastructure.instance_ips
rds_endpoints = module.aws_infrastructure.rds_endpoints
})
filename = "../ansible/inventory/terraform.ini"
}Step 2 – Ansible performs fine‑grained configuration
# site.yml – main playbook
---
- name: E‑commerce platform deployment
hosts: localhost
gather_facts: false
vars:
deployment_env: "{{ env | default('production') }}"
tasks:
- name: Prepare base environment
include_tasks: tasks/infrastructure_check.yml
- name: Deploy application services
include_tasks: tasks/application_deploy.ymlStep 3 – CI/CD pipeline integration
# .github/workflows/deploy.yml
name: Multi‑Cloud Deployment Pipeline
on:
push:
branches: [main]
paths: ['infrastructure/**', 'ansible/**']
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0
- name: Terraform Plan
run: |
cd infrastructure
terraform init
terraform plan -var-file="vars/${ENVIRONMENT}.tfvars"
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: |
terraform apply -auto-approve -var-file="vars/${ENVIRONMENT}.tfvars"
ansible:
needs: terraform
runs-on: ubuntu-latest
steps:
- name: Execute Ansible Playbook
run: |
cd ansible
ansible-playbook -i inventory/terraform.ini site.yml \
--extra-vars "env=${ENVIRONMENT}" \
--vault-password-file .vault_passAdvanced Techniques
1. State Sharing via Terraform Outputs
# outputs.tf
output "ansible_vars" {
value = {
database_endpoint = aws_rds_cluster.main.endpoint
redis_cluster_config = aws_elasticache_replication_group.main.configuration_endpoint_address
load_balancer_dns = aws_lb.main.dns_name
security_groups = {
web = aws_security_group.web.id
db = aws_security_group.db.id
}
}
sensitive = false
}
resource "local_file" "ansible_vars" {
content = yamlencode({
infrastructure = {
cloud_provider = "aws"
region = var.aws_region
environment = var.environment
}
services = local.service_endpoints
network = {
vpc_id = aws_vpc.main.id
private_subnets = aws_subnet.private[*].id
public_subnets = aws_subnet.public[*].id
}
})
filename = "../ansible/group_vars/all/terraform.yml"
}2. Dynamic Inventory Script (Python)
# inventory/terraform_inventory.py
import json, subprocess, sys
def get_terraform_output():
try:
result = subprocess.run(['terraform', 'output', '-json'], capture_output=True, text=True, cwd='../infrastructure')
return json.loads(result.stdout)
except Exception as e:
print(f"Error getting terraform output: {e}", file=sys.stderr)
return {}
def generate_inventory():
tf_output = get_terraform_output()
inventory = {'_meta': {'hostvars': {}}, 'all': {'children': ['aws', 'alicloud']}, 'aws': {'children': ['web_servers', 'db_servers'], 'vars': {'ansible_ssh_common_args': '-o StrictHostKeyChecking=no', 'cloud_provider': 'aws'}}, 'web_servers': {'hosts': []}, 'db_servers': {'hosts': []}}
if 'instance_ips' in tf_output:
for ip in tf_output['instance_ips']['value']:
inventory['web_servers']['hosts'].append(ip)
inventory['_meta']['hostvars'][ip] = {'ansible_host': ip, 'ansible_user': 'ec2-user', 'instance_type': 't3.medium'}
return inventory
if __name__ == '__main__':
print(json.dumps(generate_inventory(), indent=2))Monitoring & Observability Integration
# roles/monitoring/tasks/main.yml
- name: Deploy monitoring stack
block:
- name: Configure Prometheus
template:
src: prometheus.yml.j2
dest: /etc/prometheus/prometheus.yml
vars:
terraform_targets: "{{ terraform_monitoring_targets }}"
notify: restart prometheus
- name: Deploy Grafana dashboards
grafana_dashboard:
grafana_url: "{{ grafana_endpoint }}"
grafana_api_key: "{{ grafana_api_key }}"
dashboard: "{{ item }}"
loop:
- infrastructure-overview
- application-metrics
- multi-cloud-cost-analysis
- name: Configure alert rules
template:
src: alert-rules.yml.j2
dest: /etc/prometheus/rules/infrastructure.yml
vars:
notification_webhook: "{{ slack_webhook_url }}"Cost‑Optimization Strategies
# modules/cost-optimization/main.tf
resource "aws_autoscaling_schedule" "scale_down" {
scheduled_action_name = "scale-down-evening"
min_size = 1
max_size = 2
desired_capacity = 1
recurrence = "0 18 * * MON-FRI"
autoscaling_group_name = aws_autoscaling_group.web.name
}
resource "aws_autoscaling_schedule" "scale_up" {
scheduled_action_name = "scale-up-morning"
min_size = 2
max_size = 10
desired_capacity = 3
recurrence = "0 8 * * MON-FRI"
autoscaling_group_name = aws_autoscaling_group.web.name
}
resource "aws_autoscaling_group" "web" {
mixed_instances_policy {
instances_distribution {
on_demand_percentage = 20
spot_allocation_strategy = "diversified"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web.id
version = "$Latest"
}
override {
instance_type = "t3.medium"
weighted_capacity = "1"
}
override {
instance_type = "t3.large"
weighted_capacity = "2"
}
}
}
}Security Best Practices
1. Key Management
# playbooks/security-hardening.yml
- name: Security hardening configuration
hosts: all
become: yes
vars:
vault_secrets: "{{ vault_aws_secrets }}"
tasks:
- name: Retrieve DB password from SSM
aws_ssm_parameter_store:
name: "/{{ environment }}/database/password"
region: "{{ aws_region }}"
register: db_password
no_log: true
- name: Write secrets to Vault
hashivault_write:
mount_point: secret
secret: "{{ app_name }}/{{ environment }}"
data:
database_url: "{{ vault_secrets.database_url }}"
api_keys: "{{ vault_secrets.api_keys }}"2. Network Security (Zero‑Trust)
# aws_security_group for web tier
resource "aws_security_group" "web_tier" {
name_prefix = "web-tier-"
vpc_id = aws_vpc.main.id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}Fault‑Handling Real‑World Case
During a production rollout, cross‑cloud data‑sync latency was observed. Using the Terraform‑generated inventory, the team ran diagnostic playbooks to collect system metrics, ping remote endpoints, and query PostgreSQL replication lag, then generated an HTML report for rapid root‑cause analysis.
# playbooks/troubleshooting.yml
- name: Production fault diagnosis
hosts: all
gather_facts: yes
tasks:
- name: Collect system facts
setup:
filter: "ansible_*"
- name: Network connectivity check
command: "ping -c 4 {{ item }}"
loop: "{{ cross_region_endpoints }}"
register: ping_results
- name: Database latency test
postgresql_query:
db: "{{ db_name }}"
query: "SELECT pg_stat_replication.*, now() - sent_lsn::text::timestamp as lag"
register: replication_lag
- name: Generate diagnostic report
template:
src: diagnostic_report.j2
dest: "/tmp/diagnostic-{{ ansible_date_time.epoch }}.html"
delegate_to: localhostPerformance Tuning Secrets
Terraform Optimizations
# terraform.tf – enable parallelism and data source caching
terraform {
experiments = [module_variable_optional_attrs]
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
}
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
}
resource "aws_instance" "web" {
count = var.instance_count
ami = data.aws_ami.amazon_linux.id
instance_type = var.instance_type
for_each = var.instance_configs
tags = merge(var.default_tags, { Name = "web-${each.key}" })
}Ansible Performance Settings
# ansible.cfg – increase forks and enable pipelining
[defaults]
forks = 50
host_key_checking = False
retry_files_enabled = False
fact_caching = redis
fact_caching_timeout = 3600
fact_caching_connection = localhost:6379:0
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r
pipelining = True
control_path_dir = /tmpEnterprise‑Level Best‑Practice Summary
Tool Selection : Use Terraform for immutable infrastructure lifecycle and Ansible for mutable configuration and application deployment.
Code Organization : Separate infrastructure/ (Terraform) and ansible/ directories, with environment‑specific modules and inventories.
Versioning : Adopt semantic versioning for infrastructure modules, keep separate state files per environment, and snapshot before each change to enable one‑click rollback.
Monitoring & Alerting : Deploy Prometheus and Grafana via Ansible, monitor both resource metrics and application performance, and set cost‑anomaly alerts.
Security : Store secrets in Vault, enforce zero‑trust network groups, and rotate keys via automated playbooks.
CI/CD Integration : Trigger Terraform plan/apply and Ansible playbooks from GitHub Actions, passing environment variables and using encrypted vault passwords.
By treating infrastructure as code and coupling Terraform with Ansible, teams can shift from reactive firefighting to proactive architecture design, achieving repeatable, auditable, and scalable multi‑cloud deployments.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
