BIPI
BIPI

Terraform Drift Causes More Outages Than Bad Code Does.

Cloud Security

The console-edit, the emergency hotfix, the unclaimed resource. Drift between Terraform state and reality is the single most common cause of cloud outages we work. The remediation is process, not tooling.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 22, 2026 · 7 min read

#terraform#infrastructure-as-code#cloud-operations#devops

Every cloud team that uses Terraform discovers, sooner or later, that the state file does not match reality. Someone fixed something in the console at 2am during an incident. Someone manually rotated a key. Someone created a SQS queue from the AWS CLI to test a thing and never cleaned up. Six months later the next Terraform apply tries to delete or modify the wrong resource, and the change pipeline that everyone trusted blows up production.

We have worked outages caused by Terraform drift more often than outages caused by bad Terraform code. The fix is not better tooling. It is process discipline.

Where drift comes from

  1. Console edits during incidents. The on-call engineer is fixing something at 3am. They are not opening a Terraform PR and waiting for review. They click the button.
  2. Emergency manual scaling. Someone bumped an autoscaling group max from 10 to 30 to handle a spike, and never updated the Terraform.
  3. Resources created outside Terraform from the start. A team spun up an RDS instance via the console for a quick experiment, and now it is part of production but has no .tf file.
  4. Terraform-managed resources that were modified by other AWS automations: AWS Config, AWS Backup, IAM Access Analyzer remediations.
  5. Provider version updates that change default behaviour. AWS provider 5.x changed several resource defaults from the 4.x version, causing apparent drift on resources that nobody touched.

The bad solution: terraform refresh in CI

Some teams add terraform refresh to their CI pipeline, hoping that pulling current state will fix drift before each apply. This makes things worse. Refresh updates the state file from reality, so any unmanaged resource that should not be there gets adopted into state. Now the next plan shows no drift, but the resource is also not in any .tf file, so the next apply will delete it.

Refresh hides drift. It does not fix it. The detection signal is what you actually want.

The right pattern: drift detection as a separate pipeline

The pattern that has worked for the teams we have helped untangle this is to run terraform plan on a schedule (every 6 hours, daily, whatever fits) against production state, and alert when the plan is non-empty. Crucially, the alert is treated as an incident, not as background noise. Someone owns it, investigates, and either:

  • Adopts the unmanaged change into Terraform if it should be permanent.
  • Reverts the change if it should not exist.
  • Documents an explicit exception (with an expiry date) if it is a temporary state.
Drift that is not investigated within 24 hours becomes drift that is normalised forever. By the time you notice, the team has forgotten the change was made.

Stop the bleed: console-edit prevention

The structural fix is to make console edits in production hard. Most cloud providers support read-only IAM roles that the regular engineering team uses, with elevation to write requiring an explicit, audited break-glass process. AWS Identity Center sessions, GCP just-in-time access, Azure PIM all support this.

When write access is gated by an explicit elevation request, the friction encourages PRs. When it is one click in the console, drift is the path of least resistance.

What an actual incident looks like

A client's prod RDS instance had a parameter group manually changed during a midnight performance incident eight months earlier. Nobody updated the Terraform. When a routine apply ran for an unrelated change, the apply tried to revert the parameter group to what was in the .tf file, which required a database restart, which did not finish in time, which caused 23 minutes of downtime during business hours.

The 'fix' the team had been talking about for months was 'we should detect drift.' They had bought a tool. The tool generated reports nobody read. The actual fix was to tie drift to incident workflow: every drift detected goes to PagerDuty, every drift gets investigated within 24 hours.

Provider upgrades: the silent driver

Major Terraform provider version upgrades change resource schema, default values, and computed-attribute behaviour. When you upgrade aws provider from 4.x to 5.x in a CI run that has not run in a few months, the plan output is enormous. Most of the changes are no-ops that the new provider just calculates differently. A few are real semantic changes. The team usually approves the plan because they assume the diffs are cosmetic. Some of them are not.

Pin the provider version in every workspace. Upgrade explicitly, in a low-traffic period, with the changelog open and someone reading it. Do not bundle provider upgrades with feature changes.

Closing

Terraform drift is not a tooling problem. The tools exist. The detection works. What breaks is the process: nobody notices, nobody owns, the drift accumulates, and the next routine change becomes an outage. Treat drift detection as you would treat a security alert. Someone gets paged, someone investigates, someone documents. The teams that do this have stable Terraform pipelines. The teams that do not have routine 2am incidents that started six months earlier and they did not know it.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.