BIPI
BIPI

SOC Metrics That Don't Lie: MTTD, MTTR, and Coverage Quality

Cybersecurity

Most SOC dashboards report metrics that look impressive and mean nothing. MTTD and MTTR can be gamed in five minutes. Coverage percentages can be inflated with mapping tricks. A practical guide to metrics that survive scrutiny and reflect real capability.

By Arjun Raghavan, Security & Systems Lead, BIPI · September 29, 2023 · 10 min read

#soc-metrics#mttd#mttr#kpis#detection-engineering#reporting

Mean time to detect and mean time to respond are the most quoted SOC metrics in the industry. They are also among the easiest to manipulate. Close tickets faster and MTTR drops. Mark alerts as benign without investigation and MTTD looks instant. Both can improve by 50 percent in a quarter without a single capability change. A metric that can be gamed without improving the underlying system is not a metric, it is theater.

What MTTD Actually Means

Time to detect is the gap between when the malicious activity occurred and when the SOC issued an alert that was eventually correlated to it. The crucial word is eventually. An alert that fires within a minute on a related but ambiguous signal does not count if the analyst routed it to benign and the actual intrusion was caught three weeks later by a different rule.

  • Measure from earliest evidence in telemetry, not from the SIEM ingestion timestamp
  • Measure to the alert that led to confirmed investigation, not to any alert on the same entity
  • Exclude alerts that were dispositioned as false positive even if later proven true, because the SOC did not actually detect
  • Report median and 90th percentile, not just mean, because outliers dominate the mean and hide systemic issues

What MTTR Actually Means

Time to respond should mean time from alert to containment of the threat actor, not time from alert to ticket closure. A ticket closed because the analyst marked it benign is not a response, it is a disposition. Real MTTR requires defining stages: alert to triage start, triage to investigation start, investigation to containment, containment to recovery. Each stage has its own median and 90th percentile.

The Disposition Problem

Every metric that depends on analyst disposition is corruptible. An overworked analyst will dispose of alerts as benign to clear the queue. An under measured team will dispose of alerts as duplicate to avoid investigation. The fix is sampling: a second analyst reviews 5 to 10 percent of dispositions weekly and rates them against the original. Discrepancy rate becomes a metric in its own right.

Quality Metrics That Resist Gaming

  1. False positive rate per rule over a rolling 90 day window, with a publicly tracked target by rule
  2. Disposition discrepancy rate from peer review samples, reported per analyst and per shift
  3. Atomic test pass rate by ATT&CK technique, with a target of above 70 percent on prioritized techniques
  4. Time from rule deployment to first true positive, indicating how long unvalidated rules sit in production
  5. Mean dwell time of incidents in real intrusions, not lab tests, calculated from forensic timeline reconstruction

Operational Health Metrics

Alongside detection and response metrics, track the operational metrics that predict whether the SOC can function next quarter. Analyst attrition rate, alert volume per analyst per shift, percentage of shift time spent on triage versus hunting versus engineering, and on call burden distribution. A SOC with great MTTR and 40 percent annual attrition is not a healthy SOC, it is a treadmill.

Reporting to Leadership

  • One headline metric per quarter, framed as a trend rather than a snapshot
  • Three supporting metrics that explain the headline, including one operational health number
  • One story: an incident that illustrates a capability gain or gap, with the underlying metric change
  • An honest section on what is getting worse, because every program has a worsening trend somewhere

What Not to Report

Total alert count by itself is meaningless. A SOC with 10,000 weekly alerts could be excellent or terrible depending on signal quality. Total rules deployed is meaningless without coverage validation. Percentage of incidents closed within SLA is meaningless if the SLA is generous and the disposition discipline is loose. Pick metrics that hurt to game, not metrics that look good without effort.

Every SOC metric should answer a leadership question that matters. If a metric improves and the underlying capability has not changed, you are measuring effort, not outcome.

The Annual Recalibration

Once a year, look at the metrics that were prioritized 12 months ago and ask which ones drove real behavior change. Drop the ones that became theater. Add new ones for the capabilities you are now investing in. A metrics program that does not change with the SOC is a metrics program that the SOC has stopped taking seriously, even if leadership has not noticed yet.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.