Cloud Computing 43 min read

Unified Multi‑Cloud Management with Terraform: A One‑Stop Guide to Controlling Resources Across Clouds

This guide explains why many companies move from single‑cloud to multi‑cloud, outlines the technical pitfalls of managing resources across AWS, Alibaba Cloud, Azure and others, and provides a step‑by‑step Terraform workflow—including providers, state backends, modules, CI/CD integration, drift detection, policy as code, cost estimation and disaster‑recovery—to build a maintainable, secure multi‑cloud IaC solution.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Unified Multi‑Cloud Management with Terraform: A One‑Stop Guide to Controlling Resources Across Clouds

Problem Background

Business expansion, regulatory data‑residency, single‑provider outages, heterogeneous workloads and acquisitions drive multi‑cloud adoption.

Each cloud has its own API/SDK, IAM model, resource naming and scopes (e.g., VPC differs between AWS and Azure), making inventory, security hunting and change tracking difficult.

Terraform Core Concepts

Provider : plugin that talks to a specific cloud (e.g., aws, alicloud, azurerm, google, tencentcloud).

Resource : managed entity such as aws_instance or alicloud_vpc.

Data source : read‑only view of existing resources (e.g., aws_ami, alicloud_zones).

State : JSON file that records the real‑world state of all managed resources.

Plan : diff between desired configuration and state.

Apply : executes the plan and updates the state.

Module : reusable collection of Terraform code.

Terraform Lifecycle

1. Write HCL (*.tf files)
2. terraform init      # download providers/modules
3. terraform plan      # calculate diff
4. terraform apply     # push changes to clouds
5. terraform destroy   # delete resources (use with caution)
6. terraform state     # manipulate state file
7. terraform import    # bring existing resources under management

Installation

# macOS
brew tap hashicorp/tap
brew install hashicorp/tap/terraform

# Ubuntu / Debian
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common
wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt-get update && sudo apt-get install terraform

# Verify
terraform version

Multi‑Cloud Challenges

Provider differences – each cloud defines its own resource model.

Identity & Access – AK/SK and IAM roles are not interchangeable.

State synchronization – a single state file must track resources across clouds.

Network connectivity – VPC peering, VPN, SD‑WAN need coordinated design.

Unified monitoring – metrics, logs and alerts must be aggregated.

Compliance – data residency, encryption and audit requirements vary.

Cost management – consolidated billing and chargeback across providers.

Practical Example 1 – First Multi‑Cloud Terraform Project

Project Structure

terraform-multi-cloud/
├── main.tf               # entry point
├── versions.tf           # provider & Terraform version constraints
├── variables.tf          # input variables
├── outputs.tf            # outputs
├── terraform.tfvars      # variable values (not version‑controlled)
├── backend.tf            # remote state backend
├── providers/
│   ├── aws.tf
│   ├── alicloud.tf
│   └── azurerm.tf
├── modules/
│   ├── vpc/
│   ├── ecs/
│   └── rds/
├── environments/
│   ├── dev/
│   ├── staging/
│   └── production/
└── .gitignore

versions.tf

terraform {
  required_version = ">= 1.6.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    alicloud = {
      source  = "aliyun/alicloud"
      version = "~> 1.200"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
    tencentcloud = {
      source  = "tencentcloudstack/terraform-provider-tencentcloud"
      version = "~> 1.80"
    }
  }
}

Provider Configuration

# providers/aws.tf
provider "aws" {
  region = "ap-southeast-1"
  default_tags {
    Environment = "production"
    ManagedBy   = "terraform"
    Project     = "myapp"
  }
}

# providers/alicloud.tf
provider "alicloud" {
  region  = "cn-hangzhou"
  profile = "default"
}

# providers/azurerm.tf
provider "azurerm" {
  features {}
  subscription_id = var.azure_subscription_id
}

# providers/google.tf
provider "google" {
  project = var.gcp_project_id
  region  = "asia-southeast1"
}

Risk tip: Do not hard‑code AK/SK in code; use environment variables, CI secrets or a secret manager.

Backend Configuration (S3 example)

terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "multi-cloud/terraform.tfstate"
    region         = "ap-southeast-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:ap-southeast-1:111122223333:key/abcd-1234"
    dynamodb_table = "terraform-lock"
  }
}

Other clouds have analogous backends (OSS, COS, GCS, Azure Storage, Terraform Cloud, Consul, etcd, HTTP). Choose the backend that matches the ecosystem and supports locking.

State Commands

# List resources in state
terraform state list

# Show a specific resource
terraform state show aws_instance.web[0]

# Rename a resource in state
terraform state mv aws_instance.old aws_instance.new

# Remove a resource from state (does NOT destroy the cloud resource)
terraform state rm aws_instance.web

# Import an existing resource
terraform import aws_instance.web i-1234567890abcdef0

State lock risk: Concurrent applies on the same state cause conflicts. S3 backend uses DynamoDB for locking; OSS has its own lock. If an apply fails mid‑run, the lock may remain.

# Force unlock (high‑risk – ensure no one else is applying)
terraform force-unlock <LOCK_ID>

Practical Example 2 – State Management (S3 + DynamoDB)

resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state"
  lifecycle { prevent_destroy = true }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.terraform_state.arn
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  attribute { name = "LockID" type = "S" }
}

resource "aws_kms_key" "terraform_state" {
  description              = "KMS key for Terraform state"
  deletion_window_in_days = 30
}

Practical Example 3 – Modular Design

Basic Module Usage

module "vpc" {
  source   = "./modules/vpc"
  vpc_cidr = "10.0.0.0/16"
  vpc_name = "prod"
}

Registry Module with Version Pinning

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"  # >=5.0, <6.0
}

Git Reference Example

module "github_repo" {
  source = "[email protected]:myorg/terraform-modules.git//vpc?ref=v1.0.0"
}

Module Inputs & Outputs

# modules/vpc/variables.tf
variable "vpc_cidr" { type = string; description = "VPC CIDR" }
variable "vpc_name" { type = string; description = "VPC name" }

# modules/vpc/outputs.tf
output "vpc_id" { value = aws_vpc.this.id; description = "VPC ID" }
output "subnet_ids" { value = aws_subnet.public[*].id; description = "Public subnet IDs" }

Combining Modules

module "vpc" { source = "./modules/vpc" ... }
module "web" { source = "./modules/web" vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.subnet_ids ... }

When a module contains resources from multiple clouds, the state can become large and apply slower. Recommended practice: split multi‑cloud resources into separate modules per cloud or per business domain.

Practical Example 4 – Multi‑Environment Management

Option A: Directory‑Based Layout

environments/
├── dev/
│   ├── main.tf
│   └── terraform.tfvars
├── staging/
│   ├── main.tf
│   └── terraform.tfvars
└── production/
    ├── main.tf
    └── terraform.tfvars

Each environment has its own state, plan and apply. Advantage: complete isolation. Disadvantage: many directory switches and repeated terraform init calls.

Option B: Workspaces

terraform workspace new dev
terraform workspace new staging
terraform workspace new production
terraform workspace select production
terraform apply

Workspaces store state under a common backend with a prefix env:/<workspace_name>. Advantage: no directory changes. Disadvantage: environment differences are less obvious and large differences (different regions or clouds) are error‑prone.

Option C: Terragrunt

Terragrunt wraps Terraform to provide DRY configurations, dependency management and automatic remote‑state handling.

# terragrunt.hcl
include "root" { path = find_in_parent_folders() }
terraform { source = "../../modules//vpc" }
inputs = { vpc_cidr = "10.0.0.0/16" vpc_name = "prod" }
terragrunt init
terragrunt plan
terragrunt apply

Terragrunt adds a learning curve but is essential for large projects.

Practical Example 5 – CI/CD Integration

GitHub Actions Workflow

name: Terraform
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.6
      - name: Terraform Init
        run: terraform init
        working-directory: environments/production
      - name: Terraform Plan
        id: plan
        run: terraform plan -no-color -out=tfplan
        working-directory: environments/production
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      - name: Show Plan
        if: github.event_name == 'pull_request'
        run: terraform show -no-color tfplan
        working-directory: environments/production
      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -auto-approve tfplan
        working-directory: environments/production
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Risk tip: Applying in CI on production is high‑risk; add manual approval (e.g., environment: production + when: manual).

GitLab CI

stages:
  - validate
  - plan
  - apply

tf:validate:
  stage: validate
  image: hashicorp/terraform:1.6.6
  script:
    - cd environments/production
    - terraform init -backend=false
    - terraform validate
    - terraform fmt -check

tf:plan:
  stage: plan
  image: hashicorp/terraform:1.6.6
  script:
    - cd environments/production
    - terraform init
    - terraform plan -out=tfplan
    - terraform show -no-color tfplan > plan.txt
  artifacts:
    paths: [environments/production/tfplan]
    expire_in: 1 day
  environment:
    name: production
  rules:
    - if: $CI_MERGE_REQUEST_ID

 tf:apply:
  stage: apply
  image: hashicorp/terraform:1.6.6
  script:
    - cd environments/production
    - terraform init
    - terraform apply -auto-approve tfplan
  environment:
    name: production
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual

Atlantis (PR‑driven workflow)

version: 3
projects:
  - name: production
    dir: environments/production
    workspace: production
    terraform_version: 1.6.6
    apply_requirements: [approved, mergeable]
    workflow: terraform-workflow
workflows:
  terraform-workflow:
    plan:
      steps: [init, plan]
    apply:
      steps: [apply]

Comment atlantis plan and atlantis apply on a PR; maintainers must approve before apply.

Practical Example 6 – Drift Detection

Method 1: Scheduled plan – run terraform plan -detailed-exitcode via cron; exit code 2 indicates drift.

Method 2: Driftctl – open‑source scanner that produces JSON drift reports.

Method 3: Commercial platforms – Spacelift, Env0 provide built‑in drift detection.

Method 4: Refresh‑only plan (Terraform 1.5+) – terraform plan -refresh-only updates state without proposing changes.

Handling Detected Drift

True drift : run terraform apply to reconcile.

Cloud‑initiated change : update Terraform code to reflect the new state.

Deleted resource : remove it from state with terraform state rm.

Unexpected modification : manual review and decision.

Practical Example 7 – Common Pitfalls

Hard‑coded passwords

# Bad
resource "aws_db_instance" "this" { password = "MyPassword123" }

# Good – use a variable marked sensitive
resource "aws_db_instance" "this" { password = var.db_password }

variable "db_password" { type = string; sensitive = true }

Prefer environment variables, CI secrets, or data sources such as aws_ssm_parameter.

Circular dependencies

resource "aws_security_group" "web" { ingress { cidr_blocks = [aws_instance.bastion.private_ip] } }
resource "aws_instance" "bastion" { vpc_security_group_ids = [aws_security_group.web.id] }

Break the cycle by using CIDR blocks instead of direct IP references.

State lock deadlock

When a previous apply crashes, the lock may remain, causing Error acquiring the state lock. Resolve by confirming no active applies and running terraform force-unlock <LOCK_ID>. This operation is high‑risk.

Accidental terraform destroy

Never run terraform destroy in production without explicit approval and scoping (e.g., -target or -var to select environment). Use prevent_destroy on critical resources.

Provider version upgrade causing massive recreation

Upgrading hashicorp/aws from 5.0 to 5.5 recreated ~200 EC2 instances due to changed default values. Mitigate by testing upgrades in a staging environment, using lifecycle { ignore_changes = [...] } or prevent_destroy, and upgrading incrementally.

Cross‑region misconfiguration

Plan showed resources in ap-southeast-1 but they were created in us-east-1 because the CI job omitted the AWS_REGION variable. Fix by enforcing region variables in CI and validating them in plan output.

Practical Example 8 – Migrating Existing Resources

terraform import – single resource import, e.g., terraform import aws_instance.web i-1234567890abcdef0 or terraform import alicloud_instance.web i-bp1234567890abcdef0.

Terraformer – bulk import tool. Example:

terraformer import aws --resources=vpc,subnet,ec2 --regions=ap-southeast-1

and similar for Alibaba Cloud.

Moved block (Terraform 1.1+) – rename resources without destroy/import:

moved { from = aws_instance.old_name to = aws_instance.new_name }

Practical Example 9 – Cross‑Cloud Disaster Recovery

# Primary cluster on Alibaba Cloud
module "primary_aliyun" { source = "./modules/web" cloud = "alicloud" ... }

# DR cluster on AWS
module "dr_aws" { source = "./modules/web" cloud = "aws" ... }

# Data sync from OSS to S3
resource "alicloud_oss_bucket" "dr_source" { bucket = "myorg-dr-source" }
resource "aws_s3_bucket" "dr_target" { bucket = "myorg-dr-target" }
resource "aws_s3_bucket_replication_configuration" "dr" {
  bucket = aws_s3_bucket.dr_target.id
  role   = aws_iam_role.replication.arn
  rule {
    id     = "dr-rule"
    status = "Enabled"
    destination { bucket = aws_s3_bucket.dr_target.arn storage_class = "STANDARD_IA" }
  }
}

Key points: data synchronization, health‑check + DNS failover, regular DR drills.

Practical Example 10 – Policy as Code (OPA / Conftest)

# policy/s3_public.rego
package terraform.s3

deny[msg] {
  resource := input.resource.aws_s3_bucket[name]
  resource.acl == "public-read"
  msg := sprintf("S3 bucket '%s' cannot be public-read", [name])
}

deny[msg] {
  resource := input.resource.aws_s3_bucket[name]
  resource.acl == "public-read-write"
  msg := sprintf("S3 bucket '%s' cannot be public-read-write", [name])
}
# Validate in CI
conftest test plan.json --policy policy/

Practical Example 11 – Cost Management

# Install Infracost
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/packages/cli/docker/install.sh | sh

# Generate cost report
infracost breakdown --path=environments/production --format=json --out-file=infracost.json
infracost output --path=infracost.json --format=table

Integrate the commands into CI (GitHub Actions example omitted for brevity) to display cost changes on PRs.

Practical Example 12 – Tagging Convention

locals {
  common_tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
    Project     = "myapp"
    CostCenter  = "engineering"
    Owner       = "ops-team"
  }
}

# Apply to providers and resources
provider "aws" { default_tags { tags = local.common_tags } }
resource "alicloud_vpc" "this" { vpc_name = "prod" tags = local.common_tags }

Consistent tags simplify billing, permission management and automation.

Practical Example 13 – Monitoring & Alerts

AWS CloudWatch Alarm

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-${var.environment}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "EC2 CPU > 80% for 4 minutes"
  alarm_actions       = [aws_sns_topic.alerts.arn]
  dimensions = { AutoScalingGroupName = aws_autoscaling_group.web.name }
}

Alibaba Cloud CloudMonitor Alarm

resource "alicloud_cloud_monitor_service_alarm" "cpu_alarm" {
  service          = "ecs"
  resources        = join(",", alicloud_instance.web[*].id)
  metric_name      = "CPUUtilization"
  threshold        = 80
  statistics        = "Average"
  comparison_operator = ">"
  evaluation_count = 3
  contact_groups   = [alicloud_cloud_monitor_contact_group.default.id]
  period           = 300
}

Practical Example 14 – Failure Case Studies

Case 1: State lock dead for 6 hours – CI job OOM‑killed left a DynamoDB lock. Resolved by confirming no active applies and running terraform force-unlock. Added CI timeout and rely on Terraform 1.6+ automatic lock expiration.

Case 2: Accidental production RDS deletion – terraform destroy run in test environment without switching variables. Mitigations: add prevent_destroy on critical DBs, separate IAM permissions, require manual approval, enable automated backups and point‑in‑time recovery.

Case 3: Provider upgrade caused massive recreation – Upgrading hashicorp/aws recreated 200 EC2 instances. Mitigations: test upgrades in staging, use lifecycle { ignore_changes = [...] } or prevent_destroy, upgrade providers incrementally (minor only).

Case 4: Cross‑region resource mismatch – CI omitted AWS_REGION, resources were created in the default us-east-1. Fix: enforce region variables in CI and validate them in plan output.

Risk Checklist

Never run terraform destroy in production without manual approval and explicit -target scoping.

Use terraform force-unlock only after confirming no other apply is running.

Test provider upgrades in a non‑prod environment before applying to production.

Use lifecycle { ignore_changes = [...] } for attributes that may change outside Terraform.

Store secrets in Vault/KMS/Parameter Store; never hard‑code them.

Grant Terraform a dedicated least‑privilege IAM user/role.

Encrypt state files (SSE‑KMS for S3/OSS, server‑side encryption for other backends).

Manage cross‑cloud networking via dedicated VPC peering, VPN or SD‑WAN configurations.

Mark sensitive variables with sensitive = true to hide values in logs.

Enable prevent_destroy on critical resources.

Version‑control state (S3 versioning) and back up to another region.

Use cloud‑native account management (AWS Organizations, Alibaba Resource Directory, GCP Folders) for multi‑account governance.

Best‑Practice Checklist

[ ] Layer project: modules/, environments/, providers/.

[ ] Use remote state backend with locking and encryption.

[ ] Define variables with variable blocks; mark secrets sensitive = true.

[ ] Adopt a consistent naming scheme: {project}-{env}-{role}-{instance}.

[ ] Apply unified tags (Environment, ManagedBy, Project, CostCenter, Owner).

[ ] Separate business modules from infrastructure modules.

[ ] Pin Terraform version and provider versions; pin module versions via version or ref.

[ ] Run terraform fmt and terraform validate on every change.

[ ] Use tflint for linting and checkov / tfsec for security scanning.

[ ] Add prevent_destroy to critical resources.

[ ] Schedule drift detection (cron plan or driftctl).

[ ] CI pipeline: plan → review → apply with manual approval for prod.

[ ] Keep secrets out of code; use Vault/KMS/Parameter Store.

[ ] Enforce Policy as Code with OPA/Conftest or Sentinel.

[ ] Integrate cost monitoring (Infracost or cloud billing APIs).

[ ] Document each module (README with inputs, outputs, purpose).

FAQ

Q1: What changed in Terraform 1.5+? Introduced moved, removed, import blocks, refresh‑only plans, and state‑lock auto‑expiration (1.6+).

Q2: Is there a Terraform 0.x after 1.x? No. Terraform moved from 0.13 to 1.0; the latest is 1.7+.

Q3: What is OpenTofu? Community fork of Terraform after the BSL license change; MPL‑licensed and compatible with Terraform 1.5 API.

Q4: Terraform vs. Pulumi? Terraform uses declarative HCL and has the largest provider ecosystem. Pulumi uses general‑purpose languages (Python, Go, TypeScript) and is friendlier to developers but has a smaller ecosystem.

Q5: How to encrypt sensitive variables? Mark them sensitive = true, store values in encrypted backends (SSE‑KMS, Vault, Parameter Store), and avoid plain‑text in tfvars.

Q6: Can Terraform state be shared? Yes, via remote backends (S3, OSS, GCS, etc.) with locking; avoid concurrent applies.

Q7: How to migrate manually created resources? Use terraform import for single resources or terraformer for bulk import, then write the resource attributes into code.

Q8: How to manage resources not in state? Import them with terraform import and then codify their attributes.

Q9: Can Terraform run inside Kubernetes? Yes, via Terraform‑operator, Atlantis pod, Spacelift agent, etc.

Q10: How to speed up apply ? Use -target, increase -parallelism, split large states, or (cautiously) use -refresh=false.

Summary

Terraform provides a unified declarative language for multi‑cloud IaC, but successful adoption requires deep knowledge of each provider's resource model.

Remote, encrypted, locked state is essential for team collaboration and consistency.

Modular design keeps state size manageable and improves reuse.

CI/CD pipelines with plan‑review‑apply gating prevent accidental changes.

Regular drift detection ensures the real world matches the declared state.

Policy‑as‑code enforces security and compliance.

Cost monitoring across clouds helps control spend.

Never edit cloud resources manually; always make changes through Terraform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CI/CDState ManagementMulti-CloudCost ManagementModulesIaCTerraformDrift Detection
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.