This is a developing post. Consider this v1.
Here are the four phases of an outage as I see it:
- Phase 1: The issue occurs and we detect it.
- IDEALLY, an alert went off, and we are responding to it. Something isn't reporting in. Something is taking too long. A queue is getting backed up. Etc.
- an email or text is automatically generated.
- Not ideally: a customer called in and says the system isn't working.
- a sales person or executive gets a call from the customer. That stakeholder calls Engineering.
- In either case, we start to respond. What do we do?
- Phase 2: Response
- IDEALLY, a tech-level staff member can go to a dashboard and see an issue, and then respond with a pre-approved work-instructions to fix the issue.
Most organizations don't have this level of sophistication. Or, the issue is complicated and new. So, it moves on to becoming a crisis that needs Engineering team intervention.
- Not ideally, a crisis call is started.
- Priority?
- Return to Service (RTS). We must get the system working again ASAP.
- Who leads it?
- An Engineering or Operations Manager.
- How to involve and drive the supporting team(s)?
- A rotation is needed. You need to let folks know when they are on-duty to answer a phone call.
- What stakeholders should be notified, with what information, from whom?
- Who: Customer owners/ stakeholders.
- What information: exact time incident started. Likely impact.
- From: the support organization or someone speaking for the lead of the crisis bridge.
- Phase 3: Work the problem to return-to-service.
- IDEALLY, the technician has already responded with the pre-approved return-to-service instructions and is following the instructions to monitor the system to ensure that the system is back to working order.
- Not ideally, we are in a crisis call working the problem to resolution.
- The mind-set: return to service is all that matters!
- For a future post, let's explore this.
- Communicate:
- good use of slack/instant messaging. Developers will be on this.
- good use of webshares. Developers will be on this also to get extra eyes. The leader may sometimes need to demand this.
- Rules and guidelines on making a change in production.
- Apollo 13 quote, "let's not make things worse by guessin'". We have to communicate, and NO ONE makes a change in production without the approval of the crisis lead.
A formal request for change needs to be submitted, and this ensures the crisis lead isn't alone in the approval chain. - Variations on wrapping up the call, watching, etc.
- For another future post, there are several ways the situation can end up and be monitored and fully recovered.
- Phase 4: After the incident
- The outage reports, internal and external.
- Following up with root cause analysis and corrective technical action.
- Retrospective on organizational elements to make this run smoother.
- Translate the actions into instructions that can be reused in the future if necessary.
- If the issue wasn't automatically detected, ask why.
Several more posts needed here on the topic. Expect more refinement.