UP | HOME

Org-babel: Uses

Table of Contents

A Research Project

A research project typically produces one or more documents that describe or rely upon:

  • a data collection
  • computations and code used in data analysis or simulation
  • methodological conventions and assumptions
  • decisions among alternate analytic paths

The documents produced by a research project typically stand apart from the things they describe and rely upon, which makes it difficult for other researchers to understand fully or to reproduce the results of the research project.

A software solution to this problem was proposed by Gentleman and Temple Lang, who "introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data, …), and as a means for distributing, managing and updating the collection." 1 They summarize the uses and implications of a compendium:

  • it encapsulates the actual work of the author, not just an abridged version suitable for publication;
  • it can display different levels of detail in derived documents;
  • the computations included in it can be re-run by an interested reader, potentially with different inputs;
  • it contains explicit computational details that make it easier for an interested reader to adapt and extend the methods;
  • it enables programmatic construction of plots and tables;
  • its components can be treated as data or inputs to software and manipulated programmatically in ways perhaps not envisioned by the author.

Org-babel and Org-mode provide the tools needed to create a multi-language compendium in a single Org-mode file. This example is taken from a work in progress, one that has seen many changes in structure and organization. No claim is made that it is the best way to do things. But it works and is proving extremely useful in the conduct of the research project.

Products of the Org-babel Compendium

The example Org-babel compendium is designed to produce three derived documents:

  • a LaTeX document intended for publication in an academic journal
  • a Beamer slide show to accompany a conference presentation
  • a web page that chronicles the data entry process

The first two of these documents are held in Org-babel LaTeX code blocks, which are tangled to produce source files that can be compiled in the usual way by one of the LaTeX systems. The third uses the Org-mode HTML exporter to generate the document directly from the Org-mode file.

Organization of the Org-mode File

The Org-mode file is divided into eight sections:

* Org-mode Setup
* Software Setup
* Data
* Instructions for Use
* Documents
* Project Tracking
* Quality Control
* Notes

Org-mode Setup

The goal of the Org-mode setup is to specify the environment as completely as possible so the file exhibits the same behavior on different computers with their own Org-mode setups.

#+OPTIONS:   H:3 num:t toc:t \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t
#+OPTIONS:   TeX:t LaTeX:nil skip:nil d:nil todo:t pri:nil tags:not-in-toc
#+INFOJS_OPT: view:nil toc:nil ltoc:t mouse:underline buttons:0 path:http://orgmode.org/org-info.js
#+EXPORT_SELECT_TAGS: export
#+EXPORT_EXCLUDE_TAGS: noexport
#+STYLE: <link rel="stylesheet" type="text/css" href="http://orgmode.org/org.css" />
#+TAGS: export(e) noexport(n)
#+TODO:
#+TODO: TODO(t) STARTED(s) | DONE(d)    

Software Setup

The example project uses R software to analyze metric and categorical observations on a class of traditional Hawaiian stone tools known as adzes. The object of this section is to establish an R session and populate it with information from a remote MySQL server. Subsequent queries of the data for analysis are all local, which speeds up the process considerably.

  • The code block r-adze-session loads libraries for preparing graphics and tables, connects to the remote MySQL server with a call to another R code block, populates an R dataframe, whole.adze, and lists the R objects that were created.
    #+src_name r-adze-session
    #+begin_src R :session adze :noweb yes
      library(ggplot2)
      library(xtable)
      <<r-connect>>
      <<r-complete-2>>
      objects()
    #+end_src
    
    #+results:
    | con        |
    | d.complete |
    | whole.adze |
    
    #+srcname: r-connect
    #+begin_src R 
      library(RMySQL)
      con <- dbConnect(MySQL(), user="user", password="password", dbname="dbname", host="host")
    #+end_src
    
    #+srcname: r-complete-2
    #+begin_src R 
      whole.adze <- dbGetQuery(con, "select * from adze where edge_present = 'true' AND poll_present = 'true'")
    #+end_src
    

Data

This section puts the adze data in an Org-mode table for the interested reader. This gives access to the data without giving access to the MySQL server.

#+srcname: data-dump
#+begin_src R :colnames yes :session adze
  whole.adze
#+end_src
#+results: data-dump
| id | identifier       | storage_location | site        | weight | adze_type | bevel  | edge_present | chin_present | shoulder_present | poll_present | length_poll | length_shoulder | length_chin | width_edge | width_shoulder_front | width_shoulder_back | thickness_shoulder | thickness_chin | edge_angle | bevel_shape | edge_shape_a | edge_shape_b | face_reduced | butt_angle | color_value | complete | broken | reworked | polish  |
|----+------------------+------------------+-------------+--------+-----------+--------+--------------+--------------+------------------+--------------+-------------+-----------------+-------------+------------+----------------------+---------------------+--------------------+----------------+------------+-------------+--------------+--------------+--------------+------------+-------------+----------+--------+----------+---------|
|  1 | OA B1-30-29      | Tray 1           | 50-Oa-B1-30 |    111 | primary   | single | true         | true         | true             | true         |          92 |              48 |          11 |         33 |                   29 |                  30 |                 16 |             11 |         36 | convex      | straight     | straight     | true         |         10 |           4 | complete |        |          | present |
|  2 | 50-OA-B1-30-T8-1 | Tray 1           | 50-Oa-B1-30 |     32 | secondary | single | true         | true         | true             | true         |          58 |              28 |          19 |         19 |                   18 |                  18 |                 11 |             10 |         35 | convex      | straight     | straight     | false        |          0 |           3 | complete |        | other    | present |
...

Instructions for Use

This section gives the interested reader basic instructions on how to create the derived documents.

This can also be helpful for the author of the Org-mode document.

* Instructions for Use                                             :noexport:
** Generate HTML pages for adzes.tsdye2.com [/]
   - [ ] Run org-babel-execute-buffer, Ctrl-c Meta-b b, to refresh all
     the R code blocks
   - [ ] Publish, Ctrl-c Ctrl-e P
   - [ ] ftp, Mirror adzes subdomain
** Generate Print and Beamer documents [/]
   - [ ] Run org-babel-execute-buffer, Ctrl-c Meta-b b, to refresh all
     the R code blocks
   - [ ] Run org-babel-tangle, Ctrl-c Meta-b t, to generate
     adze_print.tex and adze_beamer.tex
   - [ ] Compile the tex files

Documents

Two documents are created with Org-babel code blocks. I find it easiest to outline the structure of the paper down to the level of the paragraph. The leaves of the Org-mode tree are paragraph topic sentences. This is a bit of work, but it is made easier by YASnippets for Org-babel code blocks and frequently used Beamer constructs. I find that the outlining process is an aid to writing and well worth the effort.

Note that the LaTeX code blocks each have a header argument :results silent so that Org-babel doesn't put the results of evaluating them in a #+results block.

* Documents
** Preamble
*** LaTeX Preamble
#+srcname: latex-preamble
#+begin_src latex :results silent :tangle adze_print.tex
\documentclass{article}
\author{A. N. Author}
\title{Article Title}

\begin{document}

\maketitle
#+end_src

*** Beamer Preamble
#+srcname: beamer-preamble
#+begin_src latex :results silent :tangle adze_beamer.tex
\documentclass{beamer}
\mode<presentation>
{
 \usetheme{Malmoe}
 \usecolortheme{default}
}
\usepackage[english]{babel}
\usepackage[latin1]{inputenc}
\usepackage{times} 
\usepackage[T1]{fontenc}
\institute{The Institute}          
\subject{An Interesting Subject}
\beamerdefaultoverlayspecification{<+->}
\usepackage{booktabs}

\title{A Presentation Title}
\author{A. N. Author}

\begin{document}

\maketitle

#+end_src

** Introduction
*** LaTeX Source
*** Beamer Source
*** R Code
*** Notes
** Methods

** Results

** Postamble

Text is entered in LaTeX code blocks using the full power of auctex and reftex.

Note the use of noweb references to insert the results of R code blocks directly into the LaTeX document.

Many sections of the LaTeX document, such as this obligatory description of the artifact collection, can be written while data capture is underway. When date capture is complete, the document can be refreshed.

** Description of the Collection
*** Notes

**** TODO Work out a summary of cross sections, with graphics, relate to Duff
     SCHEDULED: <2009-11-09 Mon>
     - plot width_shoulder_front on x,  width_shoulder_back on y,
       thickness_shoulder on symbol size
     - quadrangular adzes will plot along x=y
*** LaTeX source

#+srcname: latex-desc-coll
#+begin_src latex :results silent :tangle adze_print.tex
  \section{Description of the Collection}
  \label{sec:desc-coll}
  
#+end_src 

#+srcname: latex-adze-wt
#+begin_src latex :results silent :tangle adze_print.tex
  The distribution of complete adze blade weights is shown in 
  figure~\ref{fig:complete-weight}.  The weight range is
  <<r-weight-min()>>--<<r-weight-max()>>~g.

  \begin{figure}[htb!]
    \centering
    \includegraphics[width=5in]{<<r-complete-weight-histogram-pdf()>>}
    \caption[Weights of complete adzes]{Weights of complete adzes on a
      logarithmic scale.}
    \label{fig:complete-weight}
  \end{figure}

#+end_src

*** R routines

**** Adze blade maximum weight (whole adzes)
   A simple retrieval of the maximum adze blade weight in grams.
#+srcname: r-weight-max
#+begin_src R :session adze :exports none
 max(whole.adze$weight)
#+end_src 

#+results: r-weight-max
: 3062

**** Adze blade minimum weight (whole adzes)
   A simple retrieval of the minimum adze blade weight in grams.
#+srcname: r-weight-min
#+begin_src R :session adze :exports none
 min(whole.adze$weight)
#+end_src 

#+results: r-weight-min
: 0

#+srcname: r-complete-weight-histogram-pdf
#+begin_src R :session adze :file r/adze_wt_log.pdf :exports none
  adze.wt <- ggplot(whole.adze, aes(x = weight))
  adze.wt + geom_histogram() + scale_x_log10()
  ggsave(file = "adze_wt_log.pdf", width = 5, height = 3)
#+end_src

#+results: r-complete-weight-histogram-pdf
file:r/adze_wt_log.pdf

I find it convenient to work on the Beamer slide show at the same time as I am writing the LaTeX document.

Figures created for the LaTeX document are often useful in the Beamer slide show.

Note the correspondence between the Beamer code block and the LaTeX code block above.

*** Beamer source
#+begin_src latex
  \begin{frame}
    \frametitle{Description of the Collection}
    \begin{columns}
      \begin{column}{0.5\textwidth}
        The weight range is <<r-weight-min()>>--<<r-weight-max()>>~g
      \end{column}
      \begin{column}{0.5\textwidth}
        \begin{centering}
          \includegraphics[width =
          0.5\textwidth]{<<r-complete-weight-histogram-pdf()>>}\par 
        \end{centering}
      \end{column}
    \end{columns}
  \end{frame} 
#+end_src

Project Tracking

This section produces an HTML document that is made available to collaborators so they can track project progress.

Note that the R code blocks each have a header argument, :exports none, to keep the source out of the HTML document.

Graphics created in R are saved to file. A link to the file created by the header argument, :file, instructs the exporter to insert the graphic into the HTML document.

* Project Tracking                                                   :export:
** Complete Adzes
*** Adze Weight
   Quantiles of the complete adze blade weights:

#+srcname: r-weight-quantile-simple
#+begin_src R :session adze :exports none :results output
 quantile(whole.adze$weight)
#+end_src

#+results: r-weight-quantile-simple
:   0%  25%  50%  75% 100% 
:    0   22   38  280 3062


   The weights of complete adze blades are plotted on a log scale to
   differentiate among the lighter blades

#+srcname: r-complete-weight-histogram-png
#+begin_src R :session adze :file r/adze_wt_log.png :exports none
  adze.wt <- ggplot(whole.adze, aes(x = weight))
  adze.wt + geom_histogram() + scale_x_log10()
  ggsave(file = "adze_wt_log.png")
#+end_src

#+results: r-complete-weight-histogram-png
file:r/adze_wt_log.png

Quality Control

  • Quality control is achieved by:
    • assigning a version to the compendium
*** Compendium Version
    - Version 1
  • documenting the software versions used by the author to produce the derived documents
** Software Version Information                                    :noexport:
*** Org-mode
    - The org-version function yields sufficient information to
      identify the Org-mode and Org-babel code used by the author
#+srcname: org-version
#+begin_src emacs-lisp 
   (org-version nil)
#+end_src

#+results: org-version 
: Org-mode version 6.34trans (release_6.34c.221.gadb2)

*** R
    - The version of R used by the author  
    - Ideally, package versions would also be displayed

#+srcname: r-version
#+begin_src R :session adzes :results output
  version
#+end_src

#+results: r-version 
#+begin_example
               _                           
platform       i386-apple-darwin8.11.1     
arch           i386                        
os             darwin8.11.1                
system         i386, darwin8.11.1          
status                                     
major          2                           
minor          9.2                         
year           2009                        
month          08                          
day            24                          
svn rev        49384                       
language       R                           
version.string R version 2.9.2 (2009-08-24)
#+end_example
  • marking result blocks as original products of the compendium so readers can execute code blocks and compare results with the original. The reader can exercise quality control by comparing results against the author's. The function, compendium-results adds a stamp, COMPENDIUM, to the names of results blocks. When the reader executes code blocks, the results can be compared to the COMPENDIUM blocks. To use this function, execute the code block with C-c C-c then run it with M-x compendium-results RETURN.
(defun compendium-results ()
  "Adds COMPENDIUM to #+results: block names."
  (interactive)
  (query-replace-regexp "\\(#\\+results:.*\\)$" "\\1 COMPENDIUM")
  )

Notes

This section holds notes, TODO items, etc. It provides a high level receptacle for items saved by remember or refiled.

Summary of the Org-babel Compendium

The Org-babel compendium fulfills the characteristics specified by Gentleman and Temple Lang:

  • It encapsulates fully the actual work of the author, potentially down to the level of task scheduling and clock time
  • The derived documents display very different levels of detail, but can share components where they overlap
    • The LaTeX document for publication contains detail suitable for a journal article
    • The Beamer slide show contains detail suitable for a conference talk
    • The project tracking web site displays data as they are collected; data entry errors are caught at an early stage
  • The computations carried out for any of the derived documents can be re-run by an interested reader, either with the original data stored in Org-mode tables, or with altered data sets, and new computations can be carried out on the original data
  • Computational details are fully specified in the Org-babel compendium, which captures the data and parameters passed to functions, along with the version of the software that provides the functions
  • Plots and tables in each of the derived documents are constructed programmatically and inserted into the derived documents either through direct reference or using noweb syntax
  • The components of the Org-babel compendium can be treated as data or inputs to software, which either runs or can be made to run under Org-babel, thus allowing programmatic manipulation in ways different from those carried out by the author.

Gentleman and Temple Lang describe five kinds of software needed to create, manage, and distribute a compendium. Emacs with Org-mode and Org-babel carries out the tasks of all five kinds of software.

Authoring Software

Emacs with Org-mode and Org-babel leverages familiar tools to create a compendium in a single file. It provides easy integration and editing of code together with text. All the usual editing tools are available when editing both text chunks and code chunks; R code, for example, is edited using the facilities provided by Emacs Speaks Statistics. There is a simple mechanism for evaluating code chunks in the growing list of supported languages, i.e., place point in the code block and press C-c C-c.

Auxiliary Software

Because the components of the compendium all reside in the same Org-mode file, no auxiliary software, other than the external software applications needed to evaluate code chunks, such as R, is required.

Transformation Software

Org-mode and Org-babel provide the necessary "collection of filters" to generate the various outputs.

Quality Control Software

The issue of quality control is meant to insure that the reader of a compendium achieves the same results as the author of the compendium. The compendium described here encourages the user to check the distribution file's digital signature using md5, thus providing a mechanism to ensure that the reader's file is identical to the author's. The compendium also contains information on the version numbers of the software used by the author. In the case of this example these include the output of (org-version nil), which yields a reference to an abbreviated description of the git HEAD, and the detailed output of the R version command. Ideally, the versions of R packages used in the computations would also be included. Finally, the compendium contains a small function that the author uses to tag #+results: blocks so they are not overwritten by subsequent executions of their source blocks. In this way, the reader can execute the source blocks and directly compare results with those obtained by the author.

Distribution Software

The problems of distribution are largely solved by the fact that the Org-babel multi-language compendium can be distributed as a single ASCII text file. Because Emacs is ported to many operating environments, the compendium can be used by readers with a wide variety of hardware. In practice, the md5 digital signature should provide adequate protection against file corruption.

Footnotes: