In the two years since launching Minus, operations has been a daily part of my life. Marked by moments of panic and periods of frustration, operations has also been a source of joy and profound satisfaction.
I find that the most insidious threat to achieving operational excellence is how emergencies and the heroic efforts to resolve these emergencies starve incremental improvement projects. Though heroes are both needed and much deserving of admiration, the impact of heroic efforts is easy to overestimate. This is because while each instance of heroism has larger impact than any incremental project, incremental improvements accrete while heroism usually leave scars. Compare:
A heroic effort. Our service goes down at 11pm on a Friday evening due to a catastrophic hardware failure and our hero, responding within 5 minutes, working through the night, brings the service back up.
A systematic improvement. A project to facilitate server log analysis across the services 100 machines so that one can effectively do a single grep instead of grepping on all 100 machines.
We perceive the impact of his heroic effort with visceral clarity: the service went down, our hero intervenes, the service recovers. We perceive the impact of the logging improvements more vaguely, since the service runs with or without. Often, however, such improvements enable detectors to be built, and such detectors can detection problems early enough to entirely avert an impending emergency. On the other hand, the improvement may prove completely useless. We find it hard to estimate probability as well as properly account for incremental but accretive gains.
We perceive the cost of this heroic effort to be something like 8 hours through the night. But perhaps, in addition to these 8 hours, we should also include the subsequent two days of burnout for our intrepid hero plus 2 engineers for a day to bring the original database back up plus a day to revert the technical debt incurred in hurried code written during the emergency. We perceive the cost of the systematic improvement clearly: since it is pre-planned, it has a clear cost estimation, perhaps 3 days for two engineers.
Heroic efforts have immediate benefits and distant costs while systematic improvements have distant benefits and immediate costs. Since operational excellence is measured in months and years and not momentary greatness, we should discount the felt immediacy of the benefits of heroic efforts and the costs of systematic improvement and account for the technical debt of emergency maneuvers and the incremental and accretive benefits of systematic improvement.
Operating a service with perfect uptime is hard. No matter how superb the team, technology, and processes, operational emergencies happen. In such emergencies, I borrow a tool from aviators: the checklist. Aviators make extensive use of checklists, especially during takeoff and landing, to achieve extremely low rates of operational failures. I have found, in my work at Minus, that well designed checklists are invaluable when stress and time shortage make it difficult to think systematically and rationally.
The following checklist, progressing in urgency from most to least, describes goals matched to increasingly higher levels of operational excellence.
Remove in flight availability regressions. Before improving things, it's necessary to first bring up the service, and "bringing up the service" means everything: the customer facing systems, the availability monitoring systems, the performance monitoring systems, the business metrics systems, backups, non-critical (but valuable) functions.
Remove in flight performance regressions. After the system is fully up, then the next priority is if the service is running slower than before, to bring the service back up to the previous performance levels. Improving performance is more difficult than maintaining previous performance. So before thinking of improving things, I think we should first achieve previous levels.
Design and build features to improve forward availability. Finally, once the service is up and fast, we can plan and build. Availability is the combination of reliability and recovery time (e.g. downtime is number of failures * time to recover from each failure). Improve the completeness of the monitoring; reduce the time it takes for errors to be detected and noticed by the right human (more alarms, routing failures to the right places more quickly); reduce recovery time (how long to switch slave to master, how long to restart the service).
Improve performance monitoring. Finally, once things are up and we can expect them to be up, we can work on performance. Once availability reaches a certain level, the best way to make the service more available to the user is to improve performance (since while the site is loading, the service is not actually available). But before improving performance, we must be assured we have metrics that we trust and that measure appropriately.
Improve performance. Finally, we can do the performance optimizations, whether it is hardware, software, architecture, or product design.
Improve operational cost. If the user experience reaches a level where things are reliable and fast, we can finally think about how we can reduce our costs without sacrificing our user's experience of our product.
At the behest of teammates at Minus who like my Emacs Orgmode setup, I share my configuration. The following eLisp optimizes Emacs for my use cases: engineering design, project planning and blogging. Pasting this into your Emacs init.el
should be sufficient to incorporate these customizations into your Emacs experience.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; My Emacs Orgmode Configuration. ;; ;; Authored by Carl Hu 2012 ;; Copyright 2012 Carl Hu ;; ;; This program is distributed under the terms of the GNU General Public License. ;; ;; This program is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; ;; This program is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; ;; You should have received a copy of the GNU General Public License ;; along with this program. If not, see http://www.gnu.org/licenses. (require 'org) (add-hook 'org-mode-hook (lambda () (modify-syntax-entry (string-to-char "\u25bc") "w"))) ; Down arrow for collapsed drawer. (setq org-startup-indented t) (setq org-hide-leading-stars t) (setq org-odd-level-only nil) (setq org-insert-heading-respect-content nil) (setq org-M-RET-may-split-line '((item) (default . t))) (setq org-special-ctrl-a/e t) (setq org-return-follows-link nil) (setq org-use-speed-commands t) (setq org-startup-align-all-tables nil) (setq org-log-into-drawer nil) (setq org-tags-column 1) (setq org-ellipsis " \u25bc" ) (setq org-speed-commands-user nil) (setq org-blank-before-new-entry '((heading . nil) (plain-list-item . nil))) (setq org-completion-use-ido t) (setq org-indent-mode t) (setq org-startup-truncated nil) (setq auto-fill-mode -1) (setq-default fill-column 99999) (setq fill-column 99999) (global-auto-revert-mode t) (prefer-coding-system 'utf-8) (cua-mode t) ;; keep the cut and paste shortcut keys people are used to. (setq cua-auto-tabify-rectangles nil) ;; Don't tabify after rectangle commands (transient-mark-mode nil) ;; No region when it is not highlighted (setq cua-keep-region-after-copy t) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; python (setq-default indent-tabs-mode nil) ; use only spaces and no tabs (setq default-tab-width 4) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; a sample of how to call shell or python program from emacs. (defun todo-export () (interactive) (save-buffer) (save-some-buffers t) (call-process-shell-command "/bin/bash -c \"cd ~/Desktop/personal; git pull; git commit -am '-'; git push;\"" nil 0 )) (global-set-key (kbd "") 'todo-export) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; i like pressing f3 to switch to my todo list from wherever i am. (defun switch-to-personal-todo () (interactive) (find-file "path-to-file")) (global-set-key (kbd " ") 'todo-export) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; i like pageup page down for buffer changes. (global-set-key [(control next)] 'next-buffer) (global-set-key [(control prior)] 'previous-buffer) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Add copy a whole line to clipboard to Emacs, bound to meta-c. (defun quick-copy-line () "Copy the whole line that point is on and move to the beginning of the next line. Consecutive calls to this command append each line to the kill-ring." (interactive) (let ((beg (line-beginning-position 1)) (end (line-beginning-position 2))) (if (eq last-command 'quick-copy-line) (kill-append (buffer-substring beg end) (< end beg)) (kill-new (buffer-substring beg end)))) (beginning-of-line 2) (message "Line appended to kill-ring.")) (define-key global-map (kbd "M-c") 'quick-copy-line) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Add inserting current date time. (defun my-insert-date (prefix) "Insert the current date. With prefix-argument, use ISO format. With two prefix arguments, write out the day and month name." (interactive "P") (let ((format (cond ((not prefix) "%Y-%m-%d %H:%M") ((equal prefix '(4)) "%Y-%m-%d") ((equal prefix '(16)) "%A, %d. %B %Y"))) ) (insert (format-time-string format)))) (define-key global-map (kbd "C-M-d") 'my-insert-date)
Org2Text is an Orgmode-to-Plaintext conversion tool to make it easier to share Orgmode documents with non-Emacs users. Although Orgmode files are indeed text files, they are difficult to read directly. Orgmode also offers many export filters, text among them, but the text filters did not satisfy my needs. What I wanted is a near-WYSIWYG rendering of Orgmode in text format, an apparently simple task but that apparently few have attempted.
Thus, Org2Text is two things: one, a plain text format format for cleanly representing OrgMode; and two, a software tool, written in Python, that converts the Orgmode outlines to text.
The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. While Markdown’s syntax has been influenced by several existing text-to-HTML filters, the single biggest source of inspiration for Markdown’s syntax is the format of plain text email.
Let us say that I have created following outline in Emacs's Orgmode:
My Todo List
* At work
* Call John :URGENT:
* Optimize the logging system :MEDIUM:
* Eliminate callout and process switch
so that context switch costs and cache
coherency is improved.
* At home
* Buy supplies :OPTIONAL:
* Address
Pearl Paint
111 8th Avenue
New York City
* Pencil
* Eraser
* Clean up apartment
I find this presentation of my thoughts clear and simple. That is, Orgmode decisions on alignment and presenation seem to me to be near-optimal.
But what happens if I want to share this outline with a non-Emacs-using colleague? Well, one option is I send him the text file. Unfortunately, the raw text for this outline is:
My Todo List
* At work
** Call John :URGENT:
** Optimize the logging system :MEDIUM
*** Eliminate callout and process switch
so that context switch costs and cache
coherency is improved.
* At home
** Buy supplies :OPTIONAL:
*** Address
Pearl Paint
111 8th Avenue
New York City
** Pencil
** Eraser
* Clean up apartment
This presentation of the outline has a few drawbacks: 'Drawers' (i.e. 'Pearl Paint...') are not properly aligned with the heading (i.e. 'Address'). No indentation is used. Long lines that wrap are not aligned with headers. The difference between one different start levels is hard to discern. And so on.
This is an important feature for me, so I searched wide and far on net, but to no avail. I designed and implemented a new solution, Org2Text, which is a plaintext presentation of Orgmode outlines supplied with an concise implementation in Python.
Use plain text. It's easy to present beautiful outlines in HTML, but I often need to use plain text email as well as save in README files that are meant to be read directly in text editors. So, the exported outline must be elegant to read with only a text editor or cat
on a Unix command line.
Clean alignment. 4 spaces for every additional star. Don't use tabs because of varying presentation of tabs in email and editors. Smaller indentation is hard to discern without the color highlighting in Emacs. Long lines are wrapped and indented to align with the text of the heading.
Bullets can use unicode for clarity. Through experimentation, everyone I need to work with now uses editors and email that supports Unicode. Use the following three increasingly small bullets: •
, and ∙
.
Tags. Tags at the end of headings is hard to parse since they are not vertically aligned. Put them in the front and surround them with the latin quote characters: ‹URGENT›
.
Word wrap. The main way I send text to folks is through email. Without wrapping, consistent indentation of headings is lost. So this solution must implement word wrap.
The following is a transform of the above outline satisfying these five requirements.
● At work
• ‹URGENT› Call John
• ‹IMPORTANT› Optimize the logging system
∙ Eliminate callout and process switch
● At home
• ‹OPTIONAL› Buy supplies
∙ Address
Pearl Paint
111 8th Avenue
New York City
• Pencil
• Eraser
The complete program, written in Python, to do implements this transformation is below.
# -*- coding: utf-8 -*- # # Org2Text # # Authored by Carl Hu 2012 # Copyright 2012 Carl Hu # # This program is distributed under the terms of the GNU General Public License. # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # You should have received a copy of the GNU General Public License # along with this program. If not, see http://www.gnu.org/licenses. import sys import os import re SPACES_PER_INDENT_LEVEL = 5 TABLE_MARKER = '!#TABLE#!' def table_clean(line): ''' Line up tables to the indent. Make the corners nice. ''' l = line.lstrip() if len(l)>0 and l[0]=='|': return TABLE_MARKER + line.replace('|', ' ').replace('+', '-').replace('-',u'\u2015')[1:] else: return line # Utilities used by both export and context-aware tag filtering # REGEX_HEADING = re.compile(r'^** ') def extract_level(line): stars = REGEX_HEADING.search(line) return len(stars.group().strip()) if stars else 0 TAG_REGEX = re.compile(' :[a-zA-Z0-9_]+:') def extract_tags(line): ''' Regexp should handle :tag:, :=), 5:32:40. ''' results = TAG_REGEX.search(line) if not results: return (set(), line) else: base_line = line[:results.start()].rstrip() tag_part = line[results.start():] tags = [tag.strip().rstrip(':') for tag in tag_part.split(':') if len(tag.strip().rstrip(':')) > 0] return (set(tags), base_line) # Extraction BULLETS = { 1: u'\u25cf', # Bigger black circle 2: u'\u2022', # Medium circle (bullet) 3: u'\u2219' } # Dot whitespace_matcher = re.compile(u'^[\s%s]*' % ''.join(BULLETS.values())) def count_leading_spaces(inp): ret = whitespace_matcher.search(inp) if ret: return len(ret.group()) else: return 0 def get_indent(indent, line, is_drawer): if len(line.strip()) == 0: return (indent, line, is_drawer) level = extract_level(line) if level == 0: base_line = render_tags(line) return (indent, table_clean(base_line), True) else: c = BULLETS.get(level, u'\u2219') base_line = render_tags(line[level:].lstrip()) return (level, c + ' ' + base_line, False) def canonicalize_word(word): if len(word.strip()) == 0: return word # keep the spaces if word[-1]==' ': return word else: return word + ' ' # Rendering def render(block, width, hang, do_quote): lines = [] quote_prefix, len_quote_prefix = (u'| ', 2) if do_quote else ('', 0) hang = hang + len_quote_prefix leader = count_leading_spaces(block) if block.startswith(TABLE_MARKER): return ' '*(hang + leader) + quote_prefix + block[len(TABLE_MARKER):] for paragraph in block.split('\n'): line = '' tokenlist = [' '*hang + quote_prefix + ' '*(leader if do_quote else 0)] + paragraph.split() for word in tokenlist: # word wrap. word = canonicalize_word(word) if len(line) + len(word) < width: line = line + word else: # print 'LEADER', leader, 'HANG', hang, "PREFIXLEN:", ('APPEND/B:[%s]' % line).encode('utf-8') lines.append(line) curr_quote_prefix = quote_prefix if do_quote else ' ' # headings get spaces for wrapped lines. line = ' '*hang + curr_quote_prefix + ' ' * (leader if do_quote else 0) + word # print 'LEADER', leader, 'HANG', hang, "PREFIXLEN:", ('APPEND/B:[%s]' % line).encode('utf-8') lines.append(line) return '\n'.join(lines) def convert(input_file, output_file): indent_level = 0 is_drawer = False for line in input_file.read().splitlines(): line = line.decode('utf-8') indent_level, line_out, is_drawer = get_indent(indent_level, line, is_drawer) hang = SPACES_PER_INDENT_LEVEL * (indent_level - 1) output_file.write((render(line_out, 80, hang, is_drawer) + '\n').encode('utf-8')) # Main f_output = sys.argv[1] outfile = open(f_output, mode='w') convert(sys.stdin, outfile)
Typically, I am editing large outline such as my Todo list and I'd like to share one part of it with a colleague who is helping me. I select that section and hit C-<f5>
. The eLisp code below does this: it routes the highlighted text (or "region", in Emacs parlance) to the standard input of Org2Text, then routes the standard output of the Python program to a new Emacs buffer for use as desired.
(defun my-export-orgmode () (interactive) (save-buffer) (save-some-buffers t) (shell-command-on-region (mark) (point) (format "/bin/bash -c \"cd /home/carlhu/Desktop/personal; /usr/bin/python /home/carlhu/Desktop/personal/export_org.py %s; \"" (buffer-file-name)) nil nil nil) (add-hook 'org-mode-hook '(lambda () (define-key org-mode-map (kbd "C-f5") 'my-export-orgmode)))
Carl Hu likes to make things people use. He is an engineer and entrepreneur and has contributed on products from databases to mobile applications. He cofounded Minus in 2010 and serves as CEO. Prior to Minus, Hu served as the Principal Engineer of Amazon's RCX division (the 450 engineer team that runs Amazon.com employs one Principal Engineer); the Chief Technology Officer at Appzero.com; and Director of Technology at Progress Software. At Microsoft, Hu invented the lockless threadsafe queue widely used in the Windows NT Kernel and helped build SQL Server's query processing and storage engine. Hu graduated in Physics from the University of Toronto and is has authored patents on lockless concurrent data structures, fault tolerant distributed locking, transactional replication, and others.
Feel free to contact me in the Minus app, username @carl.
In 2004, hardware was becoming cheap enough to make building high performance databases from commodity parts very attractive. However, the reliability was poor, making the construction of highly available clusters a pressing and challenging problem. Working with friends and colleagues Ben Rousseau, Rick Tompkins, and Gus Bjorklund over three years, we developed an approach for creating a distributed highly available database system.
We built our solution by creating middleware between the application and database that leveraged the Paxos consensus algorithm and total-ordered communications to achieve good performance (better than two-phase commit) and full transactional ACID guarantees. I authored this paper to describe our work.