Unified Multi‑Cloud Management with Terraform: A One‑Stop Guide to Controlling Resources Across Clouds
This guide explains why many companies move from single‑cloud to multi‑cloud, outlines the technical pitfalls of managing resources across AWS, Alibaba Cloud, Azure and others, and provides a step‑by‑step Terraform workflow—including providers, state backends, modules, CI/CD integration, drift detection, policy as code, cost estimation and disaster‑recovery—to build a maintainable, secure multi‑cloud IaC solution.
Problem Background
Business expansion, regulatory data‑residency, single‑provider outages, heterogeneous workloads and acquisitions drive multi‑cloud adoption.
Each cloud has its own API/SDK, IAM model, resource naming and scopes (e.g., VPC differs between AWS and Azure), making inventory, security hunting and change tracking difficult.
Terraform Core Concepts
Provider : plugin that talks to a specific cloud (e.g., aws, alicloud, azurerm, google, tencentcloud).
Resource : managed entity such as aws_instance or alicloud_vpc.
Data source : read‑only view of existing resources (e.g., aws_ami, alicloud_zones).
State : JSON file that records the real‑world state of all managed resources.
Plan : diff between desired configuration and state.
Apply : executes the plan and updates the state.
Module : reusable collection of Terraform code.
Terraform Lifecycle
1. Write HCL (*.tf files)
2. terraform init # download providers/modules
3. terraform plan # calculate diff
4. terraform apply # push changes to clouds
5. terraform destroy # delete resources (use with caution)
6. terraform state # manipulate state file
7. terraform import # bring existing resources under managementInstallation
# macOS
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
# Ubuntu / Debian
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common
wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt-get update && sudo apt-get install terraform
# Verify
terraform versionMulti‑Cloud Challenges
Provider differences – each cloud defines its own resource model.
Identity & Access – AK/SK and IAM roles are not interchangeable.
State synchronization – a single state file must track resources across clouds.
Network connectivity – VPC peering, VPN, SD‑WAN need coordinated design.
Unified monitoring – metrics, logs and alerts must be aggregated.
Compliance – data residency, encryption and audit requirements vary.
Cost management – consolidated billing and chargeback across providers.
Practical Example 1 – First Multi‑Cloud Terraform Project
Project Structure
terraform-multi-cloud/
├── main.tf # entry point
├── versions.tf # provider & Terraform version constraints
├── variables.tf # input variables
├── outputs.tf # outputs
├── terraform.tfvars # variable values (not version‑controlled)
├── backend.tf # remote state backend
├── providers/
│ ├── aws.tf
│ ├── alicloud.tf
│ └── azurerm.tf
├── modules/
│ ├── vpc/
│ ├── ecs/
│ └── rds/
├── environments/
│ ├── dev/
│ ├── staging/
│ └── production/
└── .gitignoreversions.tf
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
alicloud = {
source = "aliyun/alicloud"
version = "~> 1.200"
}
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
tencentcloud = {
source = "tencentcloudstack/terraform-provider-tencentcloud"
version = "~> 1.80"
}
}
}Provider Configuration
# providers/aws.tf
provider "aws" {
region = "ap-southeast-1"
default_tags {
Environment = "production"
ManagedBy = "terraform"
Project = "myapp"
}
}
# providers/alicloud.tf
provider "alicloud" {
region = "cn-hangzhou"
profile = "default"
}
# providers/azurerm.tf
provider "azurerm" {
features {}
subscription_id = var.azure_subscription_id
}
# providers/google.tf
provider "google" {
project = var.gcp_project_id
region = "asia-southeast1"
}Risk tip: Do not hard‑code AK/SK in code; use environment variables, CI secrets or a secret manager.
Backend Configuration (S3 example)
terraform {
backend "s3" {
bucket = "myorg-terraform-state"
key = "multi-cloud/terraform.tfstate"
region = "ap-southeast-1"
encrypt = true
kms_key_id = "arn:aws:kms:ap-southeast-1:111122223333:key/abcd-1234"
dynamodb_table = "terraform-lock"
}
}Other clouds have analogous backends (OSS, COS, GCS, Azure Storage, Terraform Cloud, Consul, etcd, HTTP). Choose the backend that matches the ecosystem and supports locking.
State Commands
# List resources in state
terraform state list
# Show a specific resource
terraform state show aws_instance.web[0]
# Rename a resource in state
terraform state mv aws_instance.old aws_instance.new
# Remove a resource from state (does NOT destroy the cloud resource)
terraform state rm aws_instance.web
# Import an existing resource
terraform import aws_instance.web i-1234567890abcdef0State lock risk: Concurrent applies on the same state cause conflicts. S3 backend uses DynamoDB for locking; OSS has its own lock. If an apply fails mid‑run, the lock may remain.
# Force unlock (high‑risk – ensure no one else is applying)
terraform force-unlock <LOCK_ID>Practical Example 2 – State Management (S3 + DynamoDB)
resource "aws_s3_bucket" "terraform_state" {
bucket = "myorg-terraform-state"
lifecycle { prevent_destroy = true }
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.terraform_state.arn
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute { name = "LockID" type = "S" }
}
resource "aws_kms_key" "terraform_state" {
description = "KMS key for Terraform state"
deletion_window_in_days = 30
}Practical Example 3 – Modular Design
Basic Module Usage
module "vpc" {
source = "./modules/vpc"
vpc_cidr = "10.0.0.0/16"
vpc_name = "prod"
}Registry Module with Version Pinning
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0" # >=5.0, <6.0
}Git Reference Example
module "github_repo" {
source = "[email protected]:myorg/terraform-modules.git//vpc?ref=v1.0.0"
}Module Inputs & Outputs
# modules/vpc/variables.tf
variable "vpc_cidr" { type = string; description = "VPC CIDR" }
variable "vpc_name" { type = string; description = "VPC name" }
# modules/vpc/outputs.tf
output "vpc_id" { value = aws_vpc.this.id; description = "VPC ID" }
output "subnet_ids" { value = aws_subnet.public[*].id; description = "Public subnet IDs" }Combining Modules
module "vpc" { source = "./modules/vpc" ... }
module "web" { source = "./modules/web" vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.subnet_ids ... }When a module contains resources from multiple clouds, the state can become large and apply slower. Recommended practice: split multi‑cloud resources into separate modules per cloud or per business domain.
Practical Example 4 – Multi‑Environment Management
Option A: Directory‑Based Layout
environments/
├── dev/
│ ├── main.tf
│ └── terraform.tfvars
├── staging/
│ ├── main.tf
│ └── terraform.tfvars
└── production/
├── main.tf
└── terraform.tfvarsEach environment has its own state, plan and apply. Advantage: complete isolation. Disadvantage: many directory switches and repeated terraform init calls.
Option B: Workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new production
terraform workspace select production
terraform applyWorkspaces store state under a common backend with a prefix env:/<workspace_name>. Advantage: no directory changes. Disadvantage: environment differences are less obvious and large differences (different regions or clouds) are error‑prone.
Option C: Terragrunt
Terragrunt wraps Terraform to provide DRY configurations, dependency management and automatic remote‑state handling.
# terragrunt.hcl
include "root" { path = find_in_parent_folders() }
terraform { source = "../../modules//vpc" }
inputs = { vpc_cidr = "10.0.0.0/16" vpc_name = "prod" } terragrunt init
terragrunt plan
terragrunt applyTerragrunt adds a learning curve but is essential for large projects.
Practical Example 5 – CI/CD Integration
GitHub Actions Workflow
name: Terraform
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.6.6
- name: Terraform Init
run: terraform init
working-directory: environments/production
- name: Terraform Plan
id: plan
run: terraform plan -no-color -out=tfplan
working-directory: environments/production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Show Plan
if: github.event_name == 'pull_request'
run: terraform show -no-color tfplan
working-directory: environments/production
- name: Terraform Apply
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: terraform apply -auto-approve tfplan
working-directory: environments/production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}Risk tip: Applying in CI on production is high‑risk; add manual approval (e.g., environment: production + when: manual).
GitLab CI
stages:
- validate
- plan
- apply
tf:validate:
stage: validate
image: hashicorp/terraform:1.6.6
script:
- cd environments/production
- terraform init -backend=false
- terraform validate
- terraform fmt -check
tf:plan:
stage: plan
image: hashicorp/terraform:1.6.6
script:
- cd environments/production
- terraform init
- terraform plan -out=tfplan
- terraform show -no-color tfplan > plan.txt
artifacts:
paths: [environments/production/tfplan]
expire_in: 1 day
environment:
name: production
rules:
- if: $CI_MERGE_REQUEST_ID
tf:apply:
stage: apply
image: hashicorp/terraform:1.6.6
script:
- cd environments/production
- terraform init
- terraform apply -auto-approve tfplan
environment:
name: production
rules:
- if: $CI_COMMIT_BRANCH == "main"
when: manualAtlantis (PR‑driven workflow)
version: 3
projects:
- name: production
dir: environments/production
workspace: production
terraform_version: 1.6.6
apply_requirements: [approved, mergeable]
workflow: terraform-workflow
workflows:
terraform-workflow:
plan:
steps: [init, plan]
apply:
steps: [apply]Comment atlantis plan and atlantis apply on a PR; maintainers must approve before apply.
Practical Example 6 – Drift Detection
Method 1: Scheduled plan – run terraform plan -detailed-exitcode via cron; exit code 2 indicates drift.
Method 2: Driftctl – open‑source scanner that produces JSON drift reports.
Method 3: Commercial platforms – Spacelift, Env0 provide built‑in drift detection.
Method 4: Refresh‑only plan (Terraform 1.5+) – terraform plan -refresh-only updates state without proposing changes.
Handling Detected Drift
True drift : run terraform apply to reconcile.
Cloud‑initiated change : update Terraform code to reflect the new state.
Deleted resource : remove it from state with terraform state rm.
Unexpected modification : manual review and decision.
Practical Example 7 – Common Pitfalls
Hard‑coded passwords
# Bad
resource "aws_db_instance" "this" { password = "MyPassword123" }
# Good – use a variable marked sensitive
resource "aws_db_instance" "this" { password = var.db_password }
variable "db_password" { type = string; sensitive = true }Prefer environment variables, CI secrets, or data sources such as aws_ssm_parameter.
Circular dependencies
resource "aws_security_group" "web" { ingress { cidr_blocks = [aws_instance.bastion.private_ip] } }
resource "aws_instance" "bastion" { vpc_security_group_ids = [aws_security_group.web.id] }Break the cycle by using CIDR blocks instead of direct IP references.
State lock deadlock
When a previous apply crashes, the lock may remain, causing Error acquiring the state lock. Resolve by confirming no active applies and running terraform force-unlock <LOCK_ID>. This operation is high‑risk.
Accidental terraform destroy
Never run terraform destroy in production without explicit approval and scoping (e.g., -target or -var to select environment). Use prevent_destroy on critical resources.
Provider version upgrade causing massive recreation
Upgrading hashicorp/aws from 5.0 to 5.5 recreated ~200 EC2 instances due to changed default values. Mitigate by testing upgrades in a staging environment, using lifecycle { ignore_changes = [...] } or prevent_destroy, and upgrading incrementally.
Cross‑region misconfiguration
Plan showed resources in ap-southeast-1 but they were created in us-east-1 because the CI job omitted the AWS_REGION variable. Fix by enforcing region variables in CI and validating them in plan output.
Practical Example 8 – Migrating Existing Resources
terraform import – single resource import, e.g., terraform import aws_instance.web i-1234567890abcdef0 or terraform import alicloud_instance.web i-bp1234567890abcdef0.
Terraformer – bulk import tool. Example:
terraformer import aws --resources=vpc,subnet,ec2 --regions=ap-southeast-1and similar for Alibaba Cloud.
Moved block (Terraform 1.1+) – rename resources without destroy/import:
moved { from = aws_instance.old_name to = aws_instance.new_name }Practical Example 9 – Cross‑Cloud Disaster Recovery
# Primary cluster on Alibaba Cloud
module "primary_aliyun" { source = "./modules/web" cloud = "alicloud" ... }
# DR cluster on AWS
module "dr_aws" { source = "./modules/web" cloud = "aws" ... }
# Data sync from OSS to S3
resource "alicloud_oss_bucket" "dr_source" { bucket = "myorg-dr-source" }
resource "aws_s3_bucket" "dr_target" { bucket = "myorg-dr-target" }
resource "aws_s3_bucket_replication_configuration" "dr" {
bucket = aws_s3_bucket.dr_target.id
role = aws_iam_role.replication.arn
rule {
id = "dr-rule"
status = "Enabled"
destination { bucket = aws_s3_bucket.dr_target.arn storage_class = "STANDARD_IA" }
}
}Key points: data synchronization, health‑check + DNS failover, regular DR drills.
Practical Example 10 – Policy as Code (OPA / Conftest)
# policy/s3_public.rego
package terraform.s3
deny[msg] {
resource := input.resource.aws_s3_bucket[name]
resource.acl == "public-read"
msg := sprintf("S3 bucket '%s' cannot be public-read", [name])
}
deny[msg] {
resource := input.resource.aws_s3_bucket[name]
resource.acl == "public-read-write"
msg := sprintf("S3 bucket '%s' cannot be public-read-write", [name])
} # Validate in CI
conftest test plan.json --policy policy/Practical Example 11 – Cost Management
# Install Infracost
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/packages/cli/docker/install.sh | sh
# Generate cost report
infracost breakdown --path=environments/production --format=json --out-file=infracost.json
infracost output --path=infracost.json --format=tableIntegrate the commands into CI (GitHub Actions example omitted for brevity) to display cost changes on PRs.
Practical Example 12 – Tagging Convention
locals {
common_tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = "myapp"
CostCenter = "engineering"
Owner = "ops-team"
}
}
# Apply to providers and resources
provider "aws" { default_tags { tags = local.common_tags } }
resource "alicloud_vpc" "this" { vpc_name = "prod" tags = local.common_tags }Consistent tags simplify billing, permission management and automation.
Practical Example 13 – Monitoring & Alerts
AWS CloudWatch Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu-${var.environment}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = "120"
statistic = "Average"
threshold = "80"
alarm_description = "EC2 CPU > 80% for 4 minutes"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = { AutoScalingGroupName = aws_autoscaling_group.web.name }
}Alibaba Cloud CloudMonitor Alarm
resource "alicloud_cloud_monitor_service_alarm" "cpu_alarm" {
service = "ecs"
resources = join(",", alicloud_instance.web[*].id)
metric_name = "CPUUtilization"
threshold = 80
statistics = "Average"
comparison_operator = ">"
evaluation_count = 3
contact_groups = [alicloud_cloud_monitor_contact_group.default.id]
period = 300
}Practical Example 14 – Failure Case Studies
Case 1: State lock dead for 6 hours – CI job OOM‑killed left a DynamoDB lock. Resolved by confirming no active applies and running terraform force-unlock. Added CI timeout and rely on Terraform 1.6+ automatic lock expiration.
Case 2: Accidental production RDS deletion – terraform destroy run in test environment without switching variables. Mitigations: add prevent_destroy on critical DBs, separate IAM permissions, require manual approval, enable automated backups and point‑in‑time recovery.
Case 3: Provider upgrade caused massive recreation – Upgrading hashicorp/aws recreated 200 EC2 instances. Mitigations: test upgrades in staging, use lifecycle { ignore_changes = [...] } or prevent_destroy, upgrade providers incrementally (minor only).
Case 4: Cross‑region resource mismatch – CI omitted AWS_REGION, resources were created in the default us-east-1. Fix: enforce region variables in CI and validate them in plan output.
Risk Checklist
Never run terraform destroy in production without manual approval and explicit -target scoping.
Use terraform force-unlock only after confirming no other apply is running.
Test provider upgrades in a non‑prod environment before applying to production.
Use lifecycle { ignore_changes = [...] } for attributes that may change outside Terraform.
Store secrets in Vault/KMS/Parameter Store; never hard‑code them.
Grant Terraform a dedicated least‑privilege IAM user/role.
Encrypt state files (SSE‑KMS for S3/OSS, server‑side encryption for other backends).
Manage cross‑cloud networking via dedicated VPC peering, VPN or SD‑WAN configurations.
Mark sensitive variables with sensitive = true to hide values in logs.
Enable prevent_destroy on critical resources.
Version‑control state (S3 versioning) and back up to another region.
Use cloud‑native account management (AWS Organizations, Alibaba Resource Directory, GCP Folders) for multi‑account governance.
Best‑Practice Checklist
[ ] Layer project: modules/, environments/, providers/.
[ ] Use remote state backend with locking and encryption.
[ ] Define variables with variable blocks; mark secrets sensitive = true.
[ ] Adopt a consistent naming scheme: {project}-{env}-{role}-{instance}.
[ ] Apply unified tags (Environment, ManagedBy, Project, CostCenter, Owner).
[ ] Separate business modules from infrastructure modules.
[ ] Pin Terraform version and provider versions; pin module versions via version or ref.
[ ] Run terraform fmt and terraform validate on every change.
[ ] Use tflint for linting and checkov / tfsec for security scanning.
[ ] Add prevent_destroy to critical resources.
[ ] Schedule drift detection (cron plan or driftctl).
[ ] CI pipeline: plan → review → apply with manual approval for prod.
[ ] Keep secrets out of code; use Vault/KMS/Parameter Store.
[ ] Enforce Policy as Code with OPA/Conftest or Sentinel.
[ ] Integrate cost monitoring (Infracost or cloud billing APIs).
[ ] Document each module (README with inputs, outputs, purpose).
FAQ
Q1: What changed in Terraform 1.5+? Introduced moved, removed, import blocks, refresh‑only plans, and state‑lock auto‑expiration (1.6+).
Q2: Is there a Terraform 0.x after 1.x? No. Terraform moved from 0.13 to 1.0; the latest is 1.7+.
Q3: What is OpenTofu? Community fork of Terraform after the BSL license change; MPL‑licensed and compatible with Terraform 1.5 API.
Q4: Terraform vs. Pulumi? Terraform uses declarative HCL and has the largest provider ecosystem. Pulumi uses general‑purpose languages (Python, Go, TypeScript) and is friendlier to developers but has a smaller ecosystem.
Q5: How to encrypt sensitive variables? Mark them sensitive = true, store values in encrypted backends (SSE‑KMS, Vault, Parameter Store), and avoid plain‑text in tfvars.
Q6: Can Terraform state be shared? Yes, via remote backends (S3, OSS, GCS, etc.) with locking; avoid concurrent applies.
Q7: How to migrate manually created resources? Use terraform import for single resources or terraformer for bulk import, then write the resource attributes into code.
Q8: How to manage resources not in state? Import them with terraform import and then codify their attributes.
Q9: Can Terraform run inside Kubernetes? Yes, via Terraform‑operator, Atlantis pod, Spacelift agent, etc.
Q10: How to speed up apply ? Use -target, increase -parallelism, split large states, or (cautiously) use -refresh=false.
Summary
Terraform provides a unified declarative language for multi‑cloud IaC, but successful adoption requires deep knowledge of each provider's resource model.
Remote, encrypted, locked state is essential for team collaboration and consistency.
Modular design keeps state size manageable and improves reuse.
CI/CD pipelines with plan‑review‑apply gating prevent accidental changes.
Regular drift detection ensures the real world matches the declared state.
Policy‑as‑code enforces security and compliance.
Cost monitoring across clouds helps control spend.
Never edit cloud resources manually; always make changes through Terraform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
