All posts

How to Catch Broken Releases Before Users Report Them

A practical guide to using error tracking, release tagging, and automated alerts to detect broken deploys within minutes — not after user complaints.

The Friday Deploy Disaster

You deploy at 4 PM on Friday. Everything looks fine. You go home. At 9 PM, your inbox has 47 user complaints about a broken checkout flow. The error has been happening for 5 hours, affecting every transaction. You missed it because your error tracking dashboard was showing the same total error count — the new errors were lost in the noise.

This scenario is preventable. Here's how.

Why Bad Deploys Slip Through

Most teams detect broken releases through one of these channels:

  1. User reports — slowest and most embarrassing
  2. Manual checking — "let me click around after deploying" (unreliable)
  3. CI/CD tests — catches code bugs, not runtime issues
  4. Error monitoring — should catch it, but often doesn't

Error monitoring fails to catch bad deploys when:

  • The new errors blend into existing error noise
  • Alert rules trigger on *new error types* but not on *error rate spikes*
  • There's no release tagging, so you can't filter by deploy version
  • The dashboard shows aggregate numbers, not per-release comparisons

Release Tagging: The Foundation

Release tagging attaches a version identifier to every error event. This lets you answer: "did this error exist before deploy v2.3.1?"

Bugsly.init({
  dsn: "YOUR_DSN",
  release: "my-app@2.3.1", // Tag with your version
  environment: "production",
});

With release tags, you can:

  • Filter errors by release version
  • See which release introduced a new error
  • Compare error rates between releases
  • Identify regressions automatically

The 3-Layer Detection System

Layer 1: Error Rate Spike Alert

Set up an alert that fires when the error rate increases significantly:

Condition: Error rate > 200% of baseline
Window: 15 minutes
Action: Slack alert to #deploys channel

This catches the most common failure mode: a deploy introduces a bug that affects many requests. A 200% threshold filters out normal fluctuation while catching real problems.

Layer 2: New Error Type Alert

Condition: New error type appears
Threshold: > 10 events in 30 minutes
Action: Slack alert

This catches new bugs from the deploy — errors that literally didn't exist before. The threshold of 10 events prevents alerting on one-off edge cases.

Layer 3: Post-Deploy Health Check (Manual, 5 Minutes)

After every deploy, spend 5 minutes checking your error dashboard:

  1. Check the health indicator — is it still green?
  2. Filter by latest release — any new errors from this version?
  3. Check the event trend — any spike in the last 10 minutes?
  4. Run a smoke test — hit 3-5 critical endpoints manually

This takes 5 minutes and catches issues that automated alerts might miss (like a subtle performance degradation).

The Post-Deploy Checklist

Here's a practical checklist to use after every production deploy:

  • [ ] Deploy completed successfully (CI green)
  • [ ] Error dashboard shows no new critical errors (2-minute check)
  • [ ] Error rate is within normal range (no spike in last 10 minutes)
  • [ ] Critical user flows work (login, checkout, main feature — 3-minute smoke test)
  • [ ] Alert channel is quiet (no new alerts in 5 minutes post-deploy)

If any check fails: roll back immediately, investigate later.

The instinct is to diagnose the problem and push a fix. Resist it. Rolling back takes 2 minutes. Diagnosing and fixing might take 2 hours. Your users shouldn't wait.

The Culture Shift

The biggest change isn't technical — it's cultural. Deploy-and-forget has to become deploy-and-verify. This means:

  1. The person who deploys is responsible for the 5-minute health check — no exceptions
  2. Deploy windows matter — deploying at 4 PM Friday is inherently riskier than 10 AM Tuesday
  3. Rollback is not failure — catching a broken release quickly is a success, not a mistake

Tools That Help

Your error tracking tool should:

  • Support release tagging (most do)
  • Show error rate trends (not just totals)
  • Alert on rate spikes (not just new error types)
  • Have a health indicator you can check in 5 seconds

Bugsly's dashboard shows a green/yellow/red health badge and surfaces the most frequent unresolved errors. Combined with release tagging and spike alerts, you can detect most broken deploys within 5 minutes.

The Math

Average time to detect a broken release:

  • Via user complaints: 2-8 hours
  • Via periodic dashboard checking: 30-60 minutes
  • Via spike alerts + post-deploy check: 5-15 minutes

The difference between 5 minutes and 5 hours is the difference between affecting 100 users and affecting 10,000 users. The setup takes 15 minutes. The ROI is immediate.

Try Bugsly Free

AI-powered error tracking that explains your bugs. Set up in 2 minutes, free forever for small projects.

Get Started Free