CI/CD Observability: Beyond Pass/Fail to the Metrics That Matter
Digital Engineering
Pipeline pass/fail tells you almost nothing useful. Time-to-merge, flaky test rate, p95 build duration, and deployment frequency are where the real signal lives. Here is what to measure and what to do with it.
By Arjun Raghavan, Security & Systems Lead, BIPI · July 25, 2024 · 7 min read
Your CI dashboard shows green. Your engineers are unhappy with the build. Both can be true. We audited a fintech client whose CI pass rate was 96 percent and deploy frequency was 12 times a week, by the metrics they tracked. Their engineers reported that the build was 'hellish.' The metrics they were tracking did not capture the actual experience.
The dashboards that match developer reality measure four things: time-to-merge, flaky test rate, build duration distribution, and deployment frequency. Pass/fail is the least informative number on the dashboard.
Time-to-merge is the developer experience metric
From PR open to PR merged. This is the wall-clock number that determines how fast your team actually ships, and it is almost never what teams measure. CI duration is part of it. Review queue is part of it. Re-runs after flaky failures are part of it. Merge conflicts forcing rebases are part of it.
Healthy benchmarks we use:
If your p50 time-to-merge is over a day, your team is shipping twice a week, not daily. The fix is rarely 'faster CI.' It is usually a review queue problem, a flaky test problem, or a merge conflict problem caused by long-lived branches.
Flaky test rate: the silent productivity killer
A 1 percent flaky test rate per test, on a CI suite with 200 tests, gives you a 13 percent chance of a failed CI run that has nothing to do with the PR. At 500 tests, it is 39 percent. Engineers re-run, the build is now 20 minutes longer, the trust in CI degrades, and people start merging on yellow because 'it is probably flaky.'
The metric: percentage of CI runs where a re-run on the same commit produces a different result. We track this per test and per PR. The actions:
- Auto-quarantine tests with flake rate over 2 percent (BuildBuddy and Trunk both do this; CircleCI Test Insights too)
- Block merging tests with no owner assigned
- Weekly review of the top 10 flaky tests with named owners and a fix-by date
- Reject PRs that introduce new flaky tests (CI runs the new test 5 times before allowing merge)
Build duration: p95 matters more than average
Average build time is misleading. The metric that matches the developer experience is p95: the build time that 5 percent of builds exceed. If your average is 7 minutes and your p95 is 22 minutes, your team feels like CI takes 22 minutes.
What contributes to the p95 tail:
- Cache misses on dependency installation (npm install on a cold cache)
- Test parallelism collapsing on the slowest worker
- Docker image pulls on cold runners
- External service flakiness (npm registry, container registry, etc)
- Large PRs that affect many test groups
The instrumentation: emit OpenTelemetry spans for each major phase of the pipeline. Dependency install, build, test (parallelized), package, deploy. The bottleneck is rarely where you think; profile it.
Deployment frequency: the DORA metric that holds up
How often do you deploy to production. Daily is good. Multiple times a day is better. Weekly is a sign of friction somewhere upstream. The teams that ship multiple times a day have all four of: short-lived branches, fast CI, low flake rate, and a deployment process that is genuinely automated.
Companion metric: change failure rate, the percentage of deploys that require a rollback or hotfix. Healthy is under 15 percent. Above 30 percent and your testing is not catching things. Below 5 percent might mean you are over-testing and slowing yourself down.
Tools that are actually useful
What we have used on client engagements:
- Trunk: flaky test detection and quarantining, language-agnostic
- BuildBuddy: distributed build cache and remote execution for Bazel, plus build observability for other systems
- Datadog CI Visibility: pipeline-level traces, integrates with existing Datadog telemetry
- GitHub Actions usage metrics: built-in, basic, free
- Honeycomb pipeline tracing: best for teams already using Honeycomb
Pick one, instrument the four metrics that matter, and review them weekly. The choice of tool matters less than the discipline of looking at the numbers and acting on them. We have seen teams with $40K/year in CI tooling and no idea what their flaky test rate was. The dashboard you do not look at is worse than no dashboard.
CI is one of the most leveraged systems in your engineering org: every developer pays its cost, every day. Treat it as a product. Measure the right things. The 96 percent pass rate is meaningless if your time-to-merge is two days.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.