Rolling out IMDSv2 without breaking production
Cloud Security
IMDSv2 closes the SSRF-to-credential-theft attack that has powered half the EC2 incidents of the last five years. The rollout breaks old SDKs, container images, and golden AMIs in unpredictable ways.
By Arjun Raghavan, Security & Systems Lead, BIPI · March 21, 2024 · 7 min read
The Capital One breach in 2019 was an SSRF that read EC2 instance metadata from inside a vulnerable web application and exfiltrated the temporary credentials of the instance role. IMDSv2 was AWS's response: instead of a simple GET to 169.254.169.254, the metadata service requires a session token first, and the token request can be configured to only respond to local, low-hop-count requests. Configured correctly, IMDSv2 turns instance metadata SSRF from a one-shot credential theft into an unexploitable dead end.
Configuring it correctly across an existing fleet is where teams get stuck. Here is the migration plan we run.
Three things to set, in this order
HttpTokens: required forces clients to use IMDSv2 session tokens. HttpPutResponseHopLimit: 1 prevents containerised apps on the host from reaching the metadata service if they should not need to. HttpEndpoint: enabled keeps the service available; setting it to disabled blocks all metadata access including legitimate SDK calls and breaks almost everything.
Phase one: detect what is still on v1
CloudWatch publishes a MetadataNoToken metric per instance. Anything non-zero is an instance making v1 calls. We export this to a dashboard partitioned by AWS account and EC2 tag, run it for two weeks, and produce a list of workloads that need attention. Most clients find 30-60% of instances are still calling v1, mostly because of old AMIs or old AWS SDK versions.
- Instances launched from AMIs older than 2019 with bundled AWS CLI v1.16 or earlier
- Container images using boto3 < 1.12 or the JVM AWS SDK v1 < 1.11.678
- Older versions of the EC2 instance connect agent
- Self-hosted Prometheus exporters that use a custom HTTP client to fetch instance metadata
- Vendor agents shipped as appliances by network and observability vendors
Phase two: hop-limit-1 in the new launch templates
Before forcing tokens, set HttpPutResponseHopLimit to 1 in launch templates. This breaks containers that try to reach metadata through Docker's bridge network with the default hop limit of 2. You will find the broken workloads quickly because they will fail at startup, not silently. Fix them by giving them an explicit role via IRSA on EKS or a sidecar that injects credentials, then ratchet the hop limit down. Doing this before the v2 enforcement separates two failure modes that are hard to debug together.
Phase three: enforce IMDSv2 by default on new launches
Set the account-level default to require IMDSv2 on new instance launches. This catches Auto Scaling group refreshes, scale-out events, and new launch templates without forcing existing instances to be replaced. Existing instances stay on v1 until they are recycled. Combine this with an SCP that denies ec2:RunInstances when MetadataHttpTokens is not set to required, but keep the SCP in audit-only mode for the first month by tagging exceptions explicitly.
Phase four: existing fleet, by stack
Cycle production fleets stack by stack as part of normal patch cycles. Auto Scaling groups roll naturally; static instances need a manual modify-instance-metadata-options call followed by a restart of the SDK-using process to pick up the new behaviour. We schedule this with the platform team as part of the next quarterly maintenance window. Forcing it as a security project tends to create friction that the gradual approach avoids.
Detecting regressions after rollout
Once the fleet is on v2, MetadataNoToken should be zero everywhere. Any new non-zero reading means a new AMI, a new container image, or a new vendor agent has been introduced. Alert on it. We have caught two cases of vendor agent updates silently regressing to v1 calls; the vendor patched both within a week of being told.
IMDSv2 is one of the highest-leverage cloud security controls available. The migration is finite work and the security benefit is permanent. Treat it as a platform hygiene project, not a one-off security campaign, and it lands without drama.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.