Triaging a Down System

This is a developing post.  Consider this v1.
Here are the four phases of an outage as I see it:
  • Phase 1:  The issue occurs and we detect it.
    • IDEALLY, an alert went off, and we are responding to it.  Something isn't reporting in.  Something is taking too long.  A queue is getting backed up.  Etc.
      • an email or text is automatically generated.
    • Not ideally:  a customer called in and says the system isn't working.
      • a sales person or executive gets a call from the customer.  That stakeholder calls Engineering.
    • In either case, we start to respond.  What do we do?
  • Phase 2:  Response
    • IDEALLY, a tech-level staff member can go to a dashboard and see an issue, and then respond with a pre-approved work-instructions to fix the issue. 

      Most organizations don't have this level of sophistication.  Or, the issue is complicated and new.  So, it moves on to becoming a crisis that needs Engineering team intervention.

    • Not ideally, a crisis call is started.
      • Priority? 
        • Return to Service (RTS).  We must get the system working again ASAP.
      • Who leads it? 
        • An Engineering or Operations Manager.  
      • How to involve and drive  the supporting team(s)?
        • A rotation is needed.  You need to let folks know when they are on-duty to answer a phone call.
      • What stakeholders should be notified, with what information, from whom?
        • Who:  Customer owners/ stakeholders.
        • What information:  exact time incident started.  Likely impact.
        • From:  the support organization or someone speaking for the lead of the crisis bridge. 
  • Phase 3:  Work the problem to return-to-service.
    • IDEALLY, the technician has already responded with the pre-approved return-to-service instructions and is following the instructions to monitor the system to ensure that the system is back to working order.
    • Not ideally, we are in a crisis call working the problem to resolution.
      • The mind-set:  return to service is all that matters!
        • For a future post, let's explore this.
      • Communicate: 
        • good use of slack/instant messaging.  Developers will be on this.
        • good use of webshares.  Developers will be on this also to get extra eyes.  The leader may sometimes need to demand this.
      • Rules and guidelines on making a change in production.
        • Apollo 13 quote, "let's not make things worse by guessin'".  We have to communicate, and NO ONE makes a change in production without the approval of the crisis lead.  

          A formal request for change needs to be submitted, and this ensures the crisis lead isn't alone in the approval chain. 
      • Variations on wrapping up the call, watching, etc.
        • For another future post, there are several ways the situation can end up and be monitored and fully recovered.  
  • Phase 4: After the incident
    • The outage reports, internal and external.
    • Following up with root cause analysis and corrective technical action.
    • Retrospective on organizational elements to make this run smoother.
      • Translate the actions into instructions that can be reused in the future if necessary.
      • If the issue wasn't automatically detected, ask why.

Several more posts needed here on the topic.  Expect more refinement.

Back to blog