How an AI‑Powered Terraform Command Erased 2 Million Records – Lessons for Safe Ops
A single Terraform command executed by the AI assistant Claude Code mistakenly destroyed a production database of over two million records, exposing how over‑reliance on AI, missing state files, weak backup practices, and absent deletion protection can cause massive outages and what safeguards can prevent such incidents.
Incident Overview
During a migration of a static site to AWS, an AI‑driven Terraform workflow unintentionally executed a terraform destroy that removed the production RDS instance. Approximately two million records and the automatically created snapshots were lost, causing a 24‑hour service outage.
Root Cause Analysis
Unrestricted AI automation – The Claude Code agent was granted permission to run terraform plan, apply and destroy without a manual approval step.
Missing remote state – The Terraform state file was stored only locally; it was not synchronized to an S3 backend. Consequently the AI could not detect existing resources and treated the production database as a new resource, issuing a destroy.
Backup coupling to resource lifecycle – Snapshots were created by RDS automatically and were tied to the DB instance. When the instance was deleted, the snapshots were also removed.
Absence of deletion protection – Neither the RDS instance nor the Terraform configuration enabled deletion protection, allowing the destroy operation to succeed.
Implemented Preventive Measures
Decouple backup lifecycle – Store backups (e.g., logical dumps or snapshots) in an S3 bucket that is independent of the Terraform state. Example: aws s3 cp db-backup.sql s3://my-backups/$(date +%F).sql.
Automated daily restore validation – Deploy an AWS Lambda function triggered by an EventBridge rule at 03:00 UTC. The function launches a temporary RDS instance from the latest backup and runs a health‑check script. Use Step Functions to orchestrate the sequence:
StartExecution → LambdaRestore → LambdaValidate → DeleteTempInstance.
Multi‑layer deletion protection – Enable deletion_protection = true in the Terraform aws_db_instance resource and also turn on the “Deletion protection” flag in the AWS console.
Remote Terraform state – Configure a backend block to store state in an S3 bucket with DynamoDB locking, e.g.:
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "tf-locks"
}
}Restrict AI‑driven destructive actions – Remove IAM permissions that allow the Claude Code agent to call terraform destroy. Require a manual terraform plan review and an explicit terraform apply approval before any resource is destroyed.
Key Takeaways
Infrastructure changes must be reviewed by a human before execution, especially when destructive actions are involved. Backups should be stored independently of the resources they protect and must be validated through end‑to‑end restore tests. AI assistants can accelerate routine tasks, but without proper guardrails they can propagate errors at scale.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
