How a Tiny Cost‑Saving Decision Wiped Out 2.5 Years of Data on AWS
A data‑science community founder used an AI coding assistant to migrate a static site to AWS, unintentionally shared Terraform state, ran a full Terraform apply without a plan review, and triggered a cascade of resource creation that culminated in a terraform destroy that erased the entire production environment and two‑and‑a‑half years of user data, illustrating the dangers of over‑automation and the importance of robust backup and manual safeguards.
Background
Alexey Grigorev was migrating the AI Shipping Labs project from a static site on GitHub Pages to AWS. The original site was hosted on GitHub Pages.
Planned migration
Static site → S3 bucket
DNS → Route 53
New Django version on a sub‑domain
Switch primary domain after stability verification
Infrastructure decision
To reduce cost (~$5‑$10 per month), Alexey added the new project to the existing Terraform configuration that already managed DataTalks.Club. Both projects shared the same VPC, ECS cluster, load balancers, and bastion host.
Execution error
During deployment on 26 Feb, Alexey let the AI coding assistant Claude Code run terraform apply without first reviewing terraform plan. Terraform started creating a full set of resources because the state file from the previous workstation was missing, causing Terraform to assume the environment was empty.
When Alexey paused the run and asked why many resources were being created, Claude explained that the missing state file made Terraform think the environment was empty.
Alexey continued, creating duplicate resources, then asked Claude to delete the duplicates. Claude suggested running terraform destroy. Alexey approved, and the destroy operation removed not only the newly created duplicates but also the entire production infrastructure (RDS instance, VPC, ECS cluster, load balancers, bastion host).
Data loss and recovery
The platform stored ~1.9 million rows of course data. After the destroy, daily RDS snapshots appeared missing. Alexey opened an AWS support ticket, upgraded to Business Support (≈10 % higher cost), and AWS confirmed visible snapshots were gone but a hidden snapshot existed internally.
Within 24 hours AWS restored the hidden snapshot. Alexey recreated the database from the snapshot with Terraform and verified that the courses_answer table still contained 1,943,200 records. The service was brought back online with all user data intact.
Post‑mortem and recommendations
Limit AI automation: Disable Claude’s ability to write files or execute commands automatically. Use AI only to generate terraform plan output and perform manual review before applying.
Isolate environments: Use separate Terraform workspaces and AWS accounts for each project to avoid state contamination.
Robust backup strategy: Implement multi‑layer backups independent of Terraform, such as S3‑based backups and RDS deletion protection.
Validate backups: Run automated daily restore tests to ensure snapshots are usable.
Human gate‑keeping: Require manual approval for destructive commands like terraform destroy.
The incident demonstrates that while AI‑assisted automation can increase efficiency, final responsibility for critical infrastructure changes must remain with humans.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
