All posts

What Is On-Call for Software Engineers?

Understand on-call responsibilities for software engineers, including rotation setup, incident response, escalation policies, and reducing alert fatigue.

What Is On-Call for Software Engineers?

On-call means being available to respond to production incidents outside normal working hours. It's a shared responsibility in engineering teams that run production services.

How On-Call Works

A typical on-call setup:

  1. Rotation — engineers take turns being on-call (usually weekly)
  2. Alert — monitoring systems detect an issue and page the on-call engineer
  3. Acknowledge — the engineer acknowledges within a time window (usually 5-15 minutes)
  4. Respond — investigate, mitigate, and resolve the issue
  5. Document — write an incident report for the team

Setting Up Rotations

  • Weekly rotations work for most teams
  • Follow-the-sun for global teams (on-call during business hours per timezone)
  • Primary and secondary — backup engineer if primary doesn't respond
  • Handoff meetings — outgoing engineer briefs incoming engineer on active issues

Reducing Alert Fatigue

Alert fatigue is the biggest on-call problem. When everything alerts, nothing gets attention.

Good alerting principles:

  • Alert on symptoms, not causes — alert when users are affected, not when CPU is high
  • Every alert must be actionable — if you can't do anything about it, it shouldn't page you
  • Tune thresholds — start strict and loosen as you understand normal patterns
  • Group related alerts — one incident, one page (not 50 alerts for the same issue)

Incident Response Process

  1. Acknowledge the alert
  2. Assess severity — how many users are affected?
  3. Communicate — update status page, notify stakeholders
  4. Mitigate — restore service first, root cause later
  5. Resolve — permanent fix
  6. Postmortem — blameless review of what happened

Compensation

Healthy on-call practices include:

  • Extra pay or time off for on-call shifts
  • No more than 1 week in 4 on-call
  • Incidents during off-hours count as work time

Making On-Call Better

  • Invest in observability — good monitoring means faster resolution
  • Runbooks — documented procedures for common incidents
  • Error tracking — tools like Bugsly surface new errors before they become incidents, reducing the number of pages you receive
  • Blameless culture — incidents are learning opportunities, not blame games

On-call is a necessary part of running reliable services. Done well, it makes teams stronger and services more resilient.

Try Bugsly Free

AI-powered error tracking that explains your bugs. Set up in 2 minutes, free forever for small projects.

Get Started Free