October 13, 2012

Org2Text

Why Org2Text

Org2Text is an Orgmode-to-Plaintext conversion tool to make it easier to share Orgmode documents with non-Emacs users. Although Orgmode files are indeed text files, they are difficult to read directly. Orgmode also offers many export filters, text among them, but the text filters did not satisfy my needs. What I wanted is a near-WYSIWYG rendering of Orgmode in text format, an apparently simple task but that apparently few have attempted.

Thus, Org2Text is two things: one, a plain text format format for cleanly representing OrgMode; and two, a software tool, written in Python, that converts the Orgmode outlines to text.

The Format

The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. While Markdown’s syntax has been influenced by several existing text-to-HTML filters, the single biggest source of inspiration for Markdown’s syntax is the format of plain text email.

Let us say that I have created following outline in Emacs's Orgmode:

My Todo List
* At work
     * Call John :URGENT:
     * Optimize the logging system :MEDIUM:
           * Eliminate callout and process switch 
             so that context switch costs and cache 
             coherency is improved.
* At home
     * Buy supplies :OPTIONAL:
           * Address
             Pearl Paint
             111 8th Avenue
             New York City
           * Pencil
           * Eraser
     * Clean up apartment

I find this presentation of my thoughts clear and simple. That is, Orgmode decisions on alignment and presenation seem to me to be near-optimal.

But what happens if I want to share this outline with a non-Emacs-using colleague? Well, one option is I send him the text file. Unfortunately, the raw text for this outline is:

My Todo List
* At work
** Call John :URGENT:
** Optimize the logging system :MEDIUM
*** Eliminate callout and process switch 
so that context switch costs and cache 
coherency is improved.
* At home
** Buy supplies :OPTIONAL:
*** Address
Pearl Paint
111 8th Avenue
New York City
** Pencil
** Eraser
* Clean up apartment

This presentation of the outline has a few drawbacks: 'Drawers' (i.e. 'Pearl Paint...') are not properly aligned with the heading (i.e. 'Address'). No indentation is used. Long lines that wrap are not aligned with headers. The difference between one different start levels is hard to discern. And so on.

This is an important feature for me, so I searched wide and far on net, but to no avail. I designed and implemented a new solution, Org2Text, which is a plaintext presentation of Orgmode outlines supplied with an concise implementation in Python.

  1. Use plain text. It's easy to present beautiful outlines in HTML, but I often need to use plain text email as well as save in README files that are meant to be read directly in text editors. So, the exported outline must be elegant to read with only a text editor or cat on a Unix command line.

  2. Clean alignment. 4 spaces for every additional star. Don't use tabs because of varying presentation of tabs in email and editors. Smaller indentation is hard to discern without the color highlighting in Emacs. Long lines are wrapped and indented to align with the text of the heading.

  3. Bullets can use unicode for clarity. Through experimentation, everyone I need to work with now uses editors and email that supports Unicode. Use the following three increasingly small bullets: , and .

  4. Tags. Tags at the end of headings is hard to parse since they are not vertically aligned. Put them in the front and surround them with the latin quote characters: ‹URGENT›.

  5. Word wrap. The main way I send text to folks is through email. Without wrapping, consistent indentation of headings is lost. So this solution must implement word wrap.

The following is a transform of the above outline satisfying these five requirements.

●  At work 
     • ‹URGENT› Call John 
     • ‹IMPORTANT› Optimize the logging system 
          ∙ Eliminate callout and process switch 
● At home 
     • ‹OPTIONAL› Buy supplies 
          ∙ Address 
            Pearl Paint 
            111 8th Avenue 
            New York City 
     • Pencil 
     • Eraser 

The Python Program

The complete program, written in Python, to do implements this transformation is below.

# -*- coding: utf-8 -*-
#
# Org2Text
#
# Authored by Carl Hu 2012
# Copyright 2012 Carl Hu
# 
# This program is distributed under the terms of the GNU General Public License.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see http://www.gnu.org/licenses.

import sys
import os
import re

SPACES_PER_INDENT_LEVEL = 5
TABLE_MARKER = '!#TABLE#!'

def table_clean(line):
    ''' Line up tables to the indent. Make the corners nice. '''
    l = line.lstrip()
    if len(l)>0 and l[0]=='|':
        return TABLE_MARKER + line.replace('|', ' ').replace('+', '-').replace('-',u'\u2015')[1:]
    else:
        return line

# Utilities used by both export and context-aware tag filtering #
REGEX_HEADING = re.compile(r'^** ')
def extract_level(line):
    stars = REGEX_HEADING.search(line)
    return len(stars.group().strip()) if stars else 0

TAG_REGEX = re.compile(' :[a-zA-Z0-9_]+:')
def extract_tags(line):
    ''' Regexp should handle :tag:, :=), 5:32:40. '''
    results = TAG_REGEX.search(line)
    if not results:
        return (set(), line)
    else:
        base_line = line[:results.start()].rstrip()
        tag_part = line[results.start():]
        tags = [tag.strip().rstrip(':') for tag in tag_part.split(':') if len(tag.strip().rstrip(':')) > 0]
        return (set(tags), base_line)

# Extraction
BULLETS = { 1: u'\u25cf',  # Bigger black circle
            2: u'\u2022',  # Medium circle (bullet)
            3: u'\u2219' } # Dot

whitespace_matcher = re.compile(u'^[\s%s]*' % ''.join(BULLETS.values())) 
def count_leading_spaces(inp):
    ret = whitespace_matcher.search(inp)
    if ret: return len(ret.group())
    else: return 0

def get_indent(indent, line, is_drawer):
    if len(line.strip()) == 0: 
         return (indent, line, is_drawer)
    level = extract_level(line)
    if level == 0:
        base_line = render_tags(line)
        return (indent, table_clean(base_line), True)
    else:
        c = BULLETS.get(level, u'\u2219')
        base_line = render_tags(line[level:].lstrip())
        return (level, c + ' ' + base_line, False)

def canonicalize_word(word):
    if len(word.strip()) == 0: return word # keep the spaces
    if word[-1]==' ': return word
    else: return word + ' '

# Rendering
def render(block, width, hang, do_quote):
    lines = []
    quote_prefix, len_quote_prefix = (u'| ', 2) if do_quote else  ('', 0)
    hang = hang + len_quote_prefix
    leader = count_leading_spaces(block)
    if block.startswith(TABLE_MARKER): 
        return ' '*(hang + leader) + quote_prefix + block[len(TABLE_MARKER):] 
    for paragraph in block.split('\n'):
        line = ''
        tokenlist = [' '*hang + quote_prefix + ' '*(leader if do_quote else 0)] + paragraph.split()
        for word in tokenlist: # word wrap.
            word = canonicalize_word(word)
            if len(line) + len(word) < width:
                line = line + word            
            else:
                # print 'LEADER', leader, 'HANG', hang, "PREFIXLEN:", ('APPEND/B:[%s]' % line).encode('utf-8')
                lines.append(line)
                curr_quote_prefix = quote_prefix if do_quote else '  ' # headings get spaces for wrapped lines.
                line = ' '*hang + curr_quote_prefix + ' ' * (leader if do_quote else 0) + word
        # print 'LEADER', leader, 'HANG', hang, "PREFIXLEN:", ('APPEND/B:[%s]' % line).encode('utf-8')
        lines.append(line)
    return '\n'.join(lines)

def convert(input_file, output_file):
    indent_level = 0
    is_drawer = False
    for line in input_file.read().splitlines():
        line = line.decode('utf-8')
        indent_level, line_out, is_drawer = get_indent(indent_level, line, is_drawer)
        hang = SPACES_PER_INDENT_LEVEL * (indent_level - 1)
        output_file.write((render(line_out, 80, hang, is_drawer) + '\n').encode('utf-8'))

# Main
f_output = sys.argv[1]
outfile = open(f_output, mode='w')
convert(sys.stdin, outfile)

Typically, I am editing large outline such as my Todo list and I'd like to share one part of it with a colleague who is helping me. I select that section and hit C-<f5>. The eLisp code below does this: it routes the highlighted text (or "region", in Emacs parlance) to the standard input of Org2Text, then routes the standard output of the Python program to a new Emacs buffer for use as desired.

(defun my-export-orgmode ()
  (interactive)
  (save-buffer)
  (save-some-buffers t)
  (shell-command-on-region (mark) (point)
                           (format
                               "/bin/bash -c \"cd /home/carlhu/Desktop/personal;  /usr/bin/python /home/carlhu/Desktop/personal/export_org.py %s; \""
                               (buffer-file-name)) nil nil nil)
(add-hook 'org-mode-hook
          '(lambda () (define-key org-mode-map (kbd "C-f5") 'my-export-orgmode)))
Filed under Hacking