DDoS Incident Response: From First Page to Post-Mortem
Cybersecurity
DDoS runbook that survives contact with reality. Confirming it is actually DDoS, separating L3/4 from L7, engaging CDN and upstream providers, BGP blackhole as last resort, and the comms plan that keeps the business calm.
By Arjun Raghavan, Security & Systems Lead, BIPI · July 9, 2024 · 7 min read
DDoS is a peculiar incident because the operational pain is immediate and the forensic evidence is thin. There is rarely a malware sample or a compromised host to investigate. The work is about restoring service quickly, scoping the attack accurately, and producing a post-mortem that distinguishes between an actual adversary and a backend that fell over under modest load. The runbook below is the one I see hold up under real pressure.
Confirm it is actually DDoS
First and most embarrassing question: is this a DDoS, or is this your application failing at 200 requests per second because someone deployed a slow database query at 14:00? The signal that you have an attack and not a self-inflicted outage is asymmetry. Real attacks show source-IP entropy that does not match your normal user distribution, geographies you do not serve, ASN concentration in cloud or residential proxy networks, and request patterns that hit cheap-to-generate endpoints (login, search) rather than realistic user flows.
Pull the CDN or load-balancer logs for the last 15 minutes, group by source IP, source ASN, source country, and User-Agent. If 60% of your traffic is from three ASNs you have never heard of and every request is to /api/v1/login with a generated UA, it is DDoS. If it is uniform from your normal user base but volume is 4x normal, it might be a real product launch or a marketing push you were not told about.
L3/4 versus L7 changes everything
Layer 3/4 attacks (SYN floods, UDP amplification, reflection attacks) are about saturating your pipe or your kernel. The mitigations live at the network edge: ISP scrubbing, AWS Shield Advanced, Cloudflare Magic Transit, Akamai Prolexic. The application has no role in mitigating them. Layer 7 attacks (HTTP floods, Slowloris, application-aware bots) are about exhausting your application or database. The mitigations live at the CDN and the WAF: rate limits, bot management, JavaScript challenges, managed challenges, CAPTCHA, and JA4-based blocking.
CDN and provider engagement
Most production sites in 2024 already sit behind Cloudflare, Akamai, Fastly, or CloudFront. Your first action is to call your provider's emergency line (you do have the number memorised, do you not) and confirm they see the attack and are applying their default mitigations. Cloudflare's Under Attack mode, Akamai Kona Site Defender's adaptive rate controls, and AWS Shield Response Team engagement all exist for this moment. Use them.
If you are not behind a CDN, getting behind one during an attack is harder than it sounds because DNS propagation matters. Pre-positioning a CDN with low-TTL records for emergency cutover is the cheapest insurance policy you can buy and most teams skip it until the first incident.
BGP blackhole and RTBH
BGP-level blackhole routing (RTBH) is a last resort. You announce the target prefix to your upstream with a community tag that says discard everything to this address, and the attack traffic gets dropped before it reaches you. The cost is that you also drop legitimate traffic, so it is appropriate only when you are protecting a wider IP range and willing to sacrifice the targeted one. Most enterprises only blackhole specific IPs of a single attacked service, not entire blocks.
The comms plan is the half no one practises
While engineering is mitigating, someone has to talk to customers, executives, and possibly press. Pre-write three statements: a status page update for early uncertainty, a customer email for confirmed DDoS lasting more than an hour, and an executive talking-points doc that explains what DDoS is and is not (it is not a breach, no data is at risk, the issue is availability). If you do not have these drafts written today, write them today. You will not write them well at 02:30.
Post-incident review that learns something
After the attack, ask three questions. What was the actual capacity ceiling we hit (Mbps, packets per second, requests per second)? Which mitigation worked and which did not? What architectural change would let us absorb 5x this attack without engaging emergency processes? The answers usually look like: enable Cloudflare bot management properly, add rate limits to /api/v1/login and /search at the edge, and pre-position a second CDN. None of those are exciting. All of them mean the next attack of the same size becomes a non-event.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.