The Friday Deploy Disaster
You deploy at 4 PM on Friday. Everything looks fine. You go home. At 9 PM, your inbox has 47 user complaints about a broken checkout flow. The error has been happening for 5 hours, affecting every transaction. You missed it because your error tracking dashboard was showing the same total error count — the new errors were lost in the noise.
This scenario is preventable. Here's how.
Why Bad Deploys Slip Through
Most teams detect broken releases through one of these channels:
- User reports — slowest and most embarrassing
- Manual checking — "let me click around after deploying" (unreliable)
- CI/CD tests — catches code bugs, not runtime issues
- Error monitoring — should catch it, but often doesn't
Error monitoring fails to catch bad deploys when:
- The new errors blend into existing error noise
- Alert rules trigger on *new error types* but not on *error rate spikes*
- There's no release tagging, so you can't filter by deploy version
- The dashboard shows aggregate numbers, not per-release comparisons
Release Tagging: The Foundation
Release tagging attaches a version identifier to every error event. This lets you answer: "did this error exist before deploy v2.3.1?"
Bugsly.init({
dsn: "YOUR_DSN",
release: "my-app@2.3.1", // Tag with your version
environment: "production",
});With release tags, you can:
- Filter errors by release version
- See which release introduced a new error
- Compare error rates between releases
- Identify regressions automatically
The 3-Layer Detection System
Layer 1: Error Rate Spike Alert
Set up an alert that fires when the error rate increases significantly:
Condition: Error rate > 200% of baseline
Window: 15 minutes
Action: Slack alert to #deploys channelThis catches the most common failure mode: a deploy introduces a bug that affects many requests. A 200% threshold filters out normal fluctuation while catching real problems.
Layer 2: New Error Type Alert
Condition: New error type appears
Threshold: > 10 events in 30 minutes
Action: Slack alertThis catches new bugs from the deploy — errors that literally didn't exist before. The threshold of 10 events prevents alerting on one-off edge cases.
Layer 3: Post-Deploy Health Check (Manual, 5 Minutes)
After every deploy, spend 5 minutes checking your error dashboard:
- Check the health indicator — is it still green?
- Filter by latest release — any new errors from this version?
- Check the event trend — any spike in the last 10 minutes?
- Run a smoke test — hit 3-5 critical endpoints manually
This takes 5 minutes and catches issues that automated alerts might miss (like a subtle performance degradation).
The Post-Deploy Checklist
Here's a practical checklist to use after every production deploy:
- [ ] Deploy completed successfully (CI green)
- [ ] Error dashboard shows no new critical errors (2-minute check)
- [ ] Error rate is within normal range (no spike in last 10 minutes)
- [ ] Critical user flows work (login, checkout, main feature — 3-minute smoke test)
- [ ] Alert channel is quiet (no new alerts in 5 minutes post-deploy)
If any check fails: roll back immediately, investigate later.
The instinct is to diagnose the problem and push a fix. Resist it. Rolling back takes 2 minutes. Diagnosing and fixing might take 2 hours. Your users shouldn't wait.
The Culture Shift
The biggest change isn't technical — it's cultural. Deploy-and-forget has to become deploy-and-verify. This means:
- The person who deploys is responsible for the 5-minute health check — no exceptions
- Deploy windows matter — deploying at 4 PM Friday is inherently riskier than 10 AM Tuesday
- Rollback is not failure — catching a broken release quickly is a success, not a mistake
Tools That Help
Your error tracking tool should:
- Support release tagging (most do)
- Show error rate trends (not just totals)
- Alert on rate spikes (not just new error types)
- Have a health indicator you can check in 5 seconds
Bugsly's dashboard shows a green/yellow/red health badge and surfaces the most frequent unresolved errors. Combined with release tagging and spike alerts, you can detect most broken deploys within 5 minutes.
The Math
Average time to detect a broken release:
- Via user complaints: 2-8 hours
- Via periodic dashboard checking: 30-60 minutes
- Via spike alerts + post-deploy check: 5-15 minutes
The difference between 5 minutes and 5 hours is the difference between affecting 100 users and affecting 10,000 users. The setup takes 15 minutes. The ROI is immediate.
Try Bugsly Free
AI-powered error tracking that explains your bugs. Set up in 2 minutes, free forever for small projects.
Get Started FreeRelated Articles
React Logging Best Practices for Production
Implement effective logging in React applications with structured client-side logging, error boundaries, and remote log aggregation strategies.
Read moreExpress.js Logging Best Practices for Production
Implement structured logging in Express.js with Winston, request correlation IDs, sensitive data filtering, and log level management.
Read morePHP Performance Monitoring Best Practices
Optimize PHP application performance with OPcache tuning, query monitoring, profiling tools, and real-time performance alerting strategies.
Read moreRemix Performance Monitoring Best Practices
Monitor Remix application performance with loader timing, action profiling, streaming metrics, and server-side rendering optimization.
Read more