Terraform State Management: Lessons from Production Incidents
Real lessons on Terraform state - setting up S3 backends, handling locks, recovering from disasters, and mistakes I've made along the way.
Terraform state is the source of truth for your infrastructure. It maps your configuration to real resources, tracks dependencies, and determines what changes are needed on each apply. Mismanaging state is one of the fastest ways to corrupt your infrastructure.
I learned this the hard way. Here's what I know now.
Why Remote State Matters
Early in my Terraform journey, I used local state files. It worked fine until a colleague and I ran terraform apply simultaneously. The resulting state corruption took hours to untangle, and we had to manually reconcile resources in the AWS console.
Remote state with locking solves this completely. For AWS, I use S3 with DynamoDB:
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/infrastructure.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
}
}
The DynamoDB table provides locking - only one operation can run at a time. The S3 bucket stores state with versioning enabled for recovery.
Setting Up the Backend
I create the backend infrastructure separately, usually manually or with a bootstrap script:
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state"
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Versioning is critical. It's saved me multiple times when state got corrupted or accidentally modified.
Handling Lock Issues
When someone else is running Terraform, you'll see:
Error: Error acquiring the state lock
Lock Info:
ID: abc123def456
Operation: OperationTypeApply
Who: jenkins@ci-runner-5
Usually, just wait for the other operation to complete. But if a CI job crashed mid-apply, you'll need to force unlock:
terraform force-unlock abc123def456
Be careful. I once force-unlocked while a colleague was still running an apply. The resulting state corruption required manual cleanup. Always verify no operation is running before force unlocking.
State Commands I Use Regularly
Listing resources:
terraform state list
terraform state list 'module.database.*'
Viewing resource details:
terraform state show aws_instance.web
Moving resources (when refactoring):
terraform state mv aws_instance.web module.compute.aws_instance.web
Removing from state (when moving to manual management):
terraform state rm aws_instance.legacy
After any state manipulation, I immediately run terraform plan to verify no unintended changes.
Importing Existing Resources
When taking over existing infrastructure, import brings resources under Terraform management:
terraform import aws_instance.web i-0abc123def456789
Then run terraform state show to see current attributes and update your configuration to match. The goal is terraform plan showing no changes.
Terraform 1.5+ has import blocks which I prefer:
import {
to = aws_instance.web
id = "i-0abc123def456789"
}
Recovering from Disasters
Corrupted state: S3 versioning saves you. List previous versions and restore:
aws s3api list-object-versions \
--bucket mycompany-terraform-state \
--prefix prod/infrastructure.tfstate
aws s3api get-object \
--bucket mycompany-terraform-state \
--key prod/infrastructure.tfstate \
--version-id "abc123" \
recovered.tfstate
terraform state push recovered.tfstate
Lost state without versioning: This is painful. You'll need to import every resource manually. I've done this once - it took a full day for a moderately complex environment. Enable versioning.
Stuck lock: Check DynamoDB directly:
aws dynamodb get-item \
--table-name terraform-state-locks \
--key '{"LockID": {"S": "bucket/key.tfstate"}}'
Handling State Drift
Drift happens when someone changes resources outside Terraform. I detect it during regular plans:
terraform plan
# Shows unexpected changes
Options:
- Accept the drift - Update your configuration to match reality
- Revert the drift - Apply to restore desired state
- Refresh only - When drift is in attributes you don't manage
terraform apply -refresh-only
Mistakes I've Made
Running apply without a plan review. I changed a security group rule that broke production traffic. Now I always review plans, even for "simple" changes.
Force unlocking during an active operation. Cost me hours of manual reconciliation. Always verify the operation is truly abandoned.
Not enabling versioning from day one. Lost state on an early project. Recovery was manual and painful. Now it's the first thing I configure.
Storing sensitive data in terraform.tfvars. Accidentally committed database passwords. Now I use SSM Parameter Store exclusively for secrets.
Key Takeaways
- Always use remote state with locking - S3 + DynamoDB for AWS is the standard
- Enable versioning immediately - It's your disaster recovery lifeline
- Never force-unlock without verification - Confirm no operation is running
- Backup before state manipulation -
terraform state pull > backup.tfstate - Plan after every state operation - Verify no unexpected changes
- Keep secrets out of state files - Use SSM Parameter Store or Secrets Manager
Written by Bar Tsveker
Senior CloudOps Engineer specializing in AWS, Terraform, and infrastructure automation.
Thanks for reading! Have questions or feedback?