Triaging a Down System

January 30, 2023

This is a developing post. Consider this v1.

Here are the four phases of an outage as I see it:

Phase 1: The issue occurs and we detect it.

IDEALLY, an alert went off, and we are responding to it. Something isn't reporting in. Something is taking too long. A queue is getting backed up. Etc.

an email or text is automatically generated.

Not ideally: a customer called in and says the system isn't working.

a sales person or executive gets a call from the customer. That stakeholder calls Engineering.

In either case, we start to respond. What do we do?

Phase 2: Response

IDEALLY, a tech-level staff member can go to a dashboard and see an issue, and then respond with a pre-approved work-instructions to fix the issue.

Most organizations don't have this level of sophistication. Or, the issue is complicated and new. So, it moves on to becoming a crisis that needs Engineering team intervention.
Not ideally, a crisis call is started.

Priority?

Return to Service (RTS). We must get the system working again ASAP.

Who leads it?

An Engineering or Operations Manager.

How to involve and drive the supporting team(s)?

A rotation is needed. You need to let folks know when they are on-duty to answer a phone call.

What stakeholders should be notified, with what information, from whom?

Who: Customer owners/ stakeholders.
What information: exact time incident started. Likely impact.
From: the support organization or someone speaking for the lead of the crisis bridge.

Phase 3: Work the problem to return-to-service.

IDEALLY, the technician has already responded with the pre-approved return-to-service instructions and is following the instructions to monitor the system to ensure that the system is back to working order.
Not ideally, we are in a crisis call working the problem to resolution.

The mind-set: return to service is all that matters!

For a future post, let's explore this.

Communicate:

good use of slack/instant messaging. Developers will be on this.
good use of webshares. Developers will be on this also to get extra eyes. The leader may sometimes need to demand this.

Rules and guidelines on making a change in production.

Apollo 13 quote, "let's not make things worse by guessin'". We have to communicate, and NO ONE makes a change in production without the approval of the crisis lead.

A formal request for change needs to be submitted, and this ensures the crisis lead isn't alone in the approval chain.

Variations on wrapping up the call, watching, etc.

For another future post, there are several ways the situation can end up and be monitored and fully recovered.

Phase 4: After the incident

The outage reports, internal and external.
Following up with root cause analysis and corrective technical action.
Retrospective on organizational elements to make this run smoother.

Translate the actions into instructions that can be reused in the future if necessary.
If the issue wasn't automatically detected, ask why.

Several more posts needed here on the topic. Expect more refinement.