What Is On-Call for Software Engineers?

On-call means being available to respond to production incidents outside normal working hours. It's a shared responsibility in engineering teams that run production services.

How On-Call Works

A typical on-call setup:

Rotation — engineers take turns being on-call (usually weekly)
Alert — monitoring systems detect an issue and page the on-call engineer
Acknowledge — the engineer acknowledges within a time window (usually 5-15 minutes)
Respond — investigate, mitigate, and resolve the issue
Document — write an incident report for the team

Setting Up Rotations

Weekly rotations work for most teams
Follow-the-sun for global teams (on-call during business hours per timezone)
Primary and secondary — backup engineer if primary doesn't respond
Handoff meetings — outgoing engineer briefs incoming engineer on active issues

Reducing Alert Fatigue

Alert fatigue is the biggest on-call problem. When everything alerts, nothing gets attention.

Good alerting principles:

Alert on symptoms, not causes — alert when users are affected, not when CPU is high
Every alert must be actionable — if you can't do anything about it, it shouldn't page you
Tune thresholds — start strict and loosen as you understand normal patterns
Group related alerts — one incident, one page (not 50 alerts for the same issue)

Incident Response Process

Acknowledge the alert
Assess severity — how many users are affected?
Communicate — update status page, notify stakeholders
Mitigate — restore service first, root cause later
Resolve — permanent fix
Postmortem — blameless review of what happened

Compensation

Healthy on-call practices include:

Extra pay or time off for on-call shifts
No more than 1 week in 4 on-call
Incidents during off-hours count as work time

Making On-Call Better

Invest in observability — good monitoring means faster resolution
Runbooks — documented procedures for common incidents
Error tracking — tools like Bugsly surface new errors before they become incidents, reducing the number of pages you receive
Blameless culture — incidents are learning opportunities, not blame games

On-call is a necessary part of running reliable services. Done well, it makes teams stronger and services more resilient.