#3 Who's Afraid of Oncall?

I am, George. I am.

Dec 12, 2022

This week, I want to talk about being oncall. It’s been a topic of conversation at my day job, and I’ve spent a lot of time mulling over thoughts that I want to share. This piece only scratches the surface around oncall expectations and best practices, but I hope it gives you something to chew on.

What is oncall? It’s the organizational process in software engineering where you take a “shift” supporting your product. You’re the person who gets alerted when things break. Oncall rotations can be 24/7, they can be business-hours only, or somewhere in between. When you’re oncall, you’re carrying a pager (not literally, there’s an app for that), and being oncall often comes with the expectation that you’ll be available and responsive to issues within a certain timeframe.

When I talk about going oncall, I’m talking about being oncall for a product engineering team. I’m not talking about site reliability engineering (SRE), and I can’t speak to what it’s like to be on an oncall rotation of that type. SRE is notoriously challenging, and there are plenty of other folks who can better speak to the experiences of those rotations.

Why do we go oncall? We go oncall because our software exists in a world where there’s an expectation that it functions, around the clock. Most of us build software for people, who are relying on it to work. Software isn’t perfect, and it breaks. When it breaks, we risk losing something (commonly data, trust, money, and/or customers: trust and then customers, data and then trust and then customers and then money, you get the picture). When it breaks in a critical way, somebody has to be a first responder to facilitate fixing the problem.

I’ve experienced a variety of oncall rotations. I’ve been on teams where I got paged a lot, and I’ve been on oncall weeks where I don’t get paged at all. I’ve lost sleep, had to cancel plans, and had weeks that truly sucked. But, I’ve usually been on teams where the ethos is that oncall should only be as disruptive as it needs to be, which means that I’ve taken plenty of time to invest in work that makes oncall suck less, and less often. There are two big ways that I look for an opportunity to improve the experience of being the one holding the pager.

Right-sizing Your Alerting

When we set up alerting, we aim for actionable and clear alerts, and only necessary pages after-hours. This means that you aren’t getting paged in the middle of night for things that aren’t immediately urgent, but you are getting notified when things go very wrong. There are two main problems here, and they’re not mutually exclusive.

Problem #1: Too Much Noise

Noise, from an oncall perspective, is anything that doesn’t need to be acted on in the moment that you get a notification for it. This covers “our alerts are at the wrong level of severity, so informational logs are showing up as alerts”, to “I’m getting woken up in the middle of the night for something that isn’t actionable or urgent”.

An important step towards making oncall not Burnout Central is to only pull the fire alarm when there’s a fire. The only things that ought to page your team in the middle of the night are the things that you need to look at in the middle of the night. This might include: site is down, money didn’t move when it was supposed to, or your system is hemorrhaging important data. Things that need attention, but not middle-of-the-night attention, can and should wait for business hours. If you get paged for something at a time that feels too urgent, based on the impact of that piece of the system breaking, pay it forward and adjust the severity of the alert so that future folks get notified when they need to.

Problem #2: Not Enough Information

If all of your software went down, how would you get notified about it? If your answer didn’t involve getting alerted and/or paged, you’re probably missing some robustness around your alerting. If part of your system goes down in a way that needs engineering intervention, you should be alerting at the latest point you’d want a first responder. If it’s a wake-me-up problem, it should page you ASAP. If it’s a first-thing-during-business-hours problem, it should page you during business hours.

If you’re on a team that’s missing alerting around critical systems, your first notification of a problem is likely going to come from a panicked human. It’s probably going to come well after the system first went down, which means that now you’ve got an already-large-and-growing problem. If you had right-sized alerts for that system’s performance, you may have had the opportunity to be more proactive and mitigate the problem before it became a big problem.

Documentation / Be Kinder to Future You

Everytime you get paged, you have an opportunity to make the world a little better for the next person that gets paged for that type of problem. You’re already responding to the page, so it’s the best time to write down what you’re doing, what you’re learning, and what you wish you had by way of metrics, information, and documentation.

Runbooks aren’t built in a day, so there are some ways that you can gradually build them up over time.

Level 1

Write down a couple of sentences about what’s wrong and what you did to investigate it, or solve it. They don’t have to be particularly good sentences - I’m the queen of stream-of-consciousness rambling in a public Slack channel. Something is better than nothing. Put those sentences somewhere public, where someone can find them by searching for the error message or the type of page.

Level 2

Link those sentences to the alert itself. Make them easier to find when you get paged. If you’re so inclined, put them all in one place. Now you have the beginnings of an oncall runbook! Add to it as you get different types of pages, or as you resolve different types of errors.

Level 3

Fix your documentation. Assume that every person that goes oncall has no context about what they’re looking at when they get paged, and they’re getting woken up in the middle of the night. Make sure that there are clear statements around what is going wrong, why it’s an issue, how to mitigate it, and how to check that you’ve successfully mitigated it. If you’re ever following a runbook that doesn’t have a clear articulation of the above, do your best to improve it, a little at a time.

What If We’re Completely Underwater?

If your team’s in a spot where oncall rotations are entirely terrible, every single time, your rotation is burning people out, and things aren’t getting any better, you need an intervention. Your team needs to invest some significant time in the above two things, before doing much more feature development work. There’s probably more to say here, but how to dig yourself out of a hole is a topic in and of itself.

The Second 90%

Discussion about this post