In the two years since launching Minus, operations has been a daily part of my life. Marked by moments of panic and periods of frustration, operations has also been a source of joy and profound satisfaction.
I find that the most insidious threat to achieving operational excellence is how emergencies and the heroic efforts to resolve these emergencies starve incremental improvement projects. Though heroes are both needed and much deserving of admiration, the impact of heroic efforts is easy to overestimate. This is because while each instance of heroism has larger impact than any incremental project, incremental improvements accrete while heroism usually leave scars. Compare:
A heroic effort. Our service goes down at 11pm on a Friday evening due to a catastrophic hardware failure and our hero, responding within 5 minutes, working through the night, brings the service back up.
A systematic improvement. A project to facilitate server log analysis across the services 100 machines so that one can effectively do a single grep instead of grepping on all 100 machines.
We perceive the impact of his heroic effort with visceral clarity: the service went down, our hero intervenes, the service recovers. We perceive the impact of the logging improvements more vaguely, since the service runs with or without. Often, however, such improvements enable detectors to be built, and such detectors can detection problems early enough to entirely avert an impending emergency. On the other hand, the improvement may prove completely useless. We find it hard to estimate probability as well as properly account for incremental but accretive gains.
We perceive the cost of this heroic effort to be something like 8 hours through the night. But perhaps, in addition to these 8 hours, we should also include the subsequent two days of burnout for our intrepid hero plus 2 engineers for a day to bring the original database back up plus a day to revert the technical debt incurred in hurried code written during the emergency. We perceive the cost of the systematic improvement clearly: since it is pre-planned, it has a clear cost estimation, perhaps 3 days for two engineers.
Heroic efforts have immediate benefits and distant costs while systematic improvements have distant benefits and immediate costs. Since operational excellence is measured in months and years and not momentary greatness, we should discount the felt immediacy of the benefits of heroic efforts and the costs of systematic improvement and account for the technical debt of emergency maneuvers and the incremental and accretive benefits of systematic improvement.