What Is On-Call for Software Engineers?
On-call means being available to respond to production incidents outside normal working hours. It's a shared responsibility in engineering teams that run production services.
How On-Call Works
A typical on-call setup:
- Rotation — engineers take turns being on-call (usually weekly)
- Alert — monitoring systems detect an issue and page the on-call engineer
- Acknowledge — the engineer acknowledges within a time window (usually 5-15 minutes)
- Respond — investigate, mitigate, and resolve the issue
- Document — write an incident report for the team
Setting Up Rotations
- Weekly rotations work for most teams
- Follow-the-sun for global teams (on-call during business hours per timezone)
- Primary and secondary — backup engineer if primary doesn't respond
- Handoff meetings — outgoing engineer briefs incoming engineer on active issues
Reducing Alert Fatigue
Alert fatigue is the biggest on-call problem. When everything alerts, nothing gets attention.
Good alerting principles:
- Alert on symptoms, not causes — alert when users are affected, not when CPU is high
- Every alert must be actionable — if you can't do anything about it, it shouldn't page you
- Tune thresholds — start strict and loosen as you understand normal patterns
- Group related alerts — one incident, one page (not 50 alerts for the same issue)
Incident Response Process
- Acknowledge the alert
- Assess severity — how many users are affected?
- Communicate — update status page, notify stakeholders
- Mitigate — restore service first, root cause later
- Resolve — permanent fix
- Postmortem — blameless review of what happened
Compensation
Healthy on-call practices include:
- Extra pay or time off for on-call shifts
- No more than 1 week in 4 on-call
- Incidents during off-hours count as work time
Making On-Call Better
- Invest in observability — good monitoring means faster resolution
- Runbooks — documented procedures for common incidents
- Error tracking — tools like Bugsly surface new errors before they become incidents, reducing the number of pages you receive
- Blameless culture — incidents are learning opportunities, not blame games
On-call is a necessary part of running reliable services. Done well, it makes teams stronger and services more resilient.
Try Bugsly Free
AI-powered error tracking that explains your bugs. Set up in 2 minutes, free forever for small projects.
Get Started FreeRelated Articles
Fix TimeoutError in PHP In Production
Step-by-step guide to fix TimeoutError in PHP In Production. Includes root cause analysis, code examples, debugging tips, and prevention strategies.
Read moreHow to Fix Deadlock in Java
Learn how to fix the Deadlock in Java. Step-by-step guide with code examples.
Read moreHow to Fix Deadlock in Rust
Learn how to fix the Deadlock in Rust. Step-by-step guide with code examples.
Read moreFix SSL Error in Rails
Step-by-step guide to fix SSL Error in Rails. Includes root cause analysis, code examples, debugging tips, and prevention strategies.
Read more