November 23, 2012

A Checklist for Recovering from an Operational Emergency

Operating a service with perfect uptime is hard. No matter how superb the team, technology, and processes, operational emergencies happen. In such emergencies, I borrow a tool from aviators: the checklist. Aviators make extensive use of checklists, especially during takeoff and landing, to achieve extremely low rates of operational failures. I have found, in my work at Minus, that well designed checklists are invaluable when stress and time shortage make it difficult to think systematically and rationally.

The following checklist, progressing in urgency from most to least, describes goals matched to increasingly higher levels of operational excellence.

  1. Remove in flight availability regressions. Before improving things, it's necessary to first bring up the service, and "bringing up the service" means everything: the customer facing systems, the availability monitoring systems, the performance monitoring systems, the business metrics systems, backups, non-critical (but valuable) functions.

  2. Remove in flight performance regressions. After the system is fully up, then the next priority is if the service is running slower than before, to bring the service back up to the previous performance levels. Improving performance is more difficult than maintaining previous performance. So before thinking of improving things, I think we should first achieve previous levels.

  3. Design and build features to improve forward availability. Finally, once the service is up and fast, we can plan and build. Availability is the combination of reliability and recovery time (e.g. downtime is number of failures * time to recover from each failure). Improve the completeness of the monitoring; reduce the time it takes for errors to be detected and noticed by the right human (more alarms, routing failures to the right places more quickly); reduce recovery time (how long to switch slave to master, how long to restart the service).

  4. Improve performance monitoring. Finally, once things are up and we can expect them to be up, we can work on performance. Once availability reaches a certain level, the best way to make the service more available to the user is to improve performance (since while the site is loading, the service is not actually available). But before improving performance, we must be assured we have metrics that we trust and that measure appropriately.

  5. Improve performance. Finally, we can do the performance optimizations, whether it is hardware, software, architecture, or product design.

  6. Improve operational cost. If the user experience reaches a level where things are reliable and fast, we can finally think about how we can reduce our costs without sacrificing our user's experience of our product.

Filed under Hacking