Org-babel: Uses
Table of Contents
A Research Project
A research project typically produces one or more documents that describe or rely upon:
- a data collection
- computations and code used in data analysis or simulation
- methodological conventions and assumptions
- decisions among alternate analytic paths
The documents produced by a research project typically stand apart from the things they describe and rely upon, which makes it difficult for other researchers to understand fully or to reproduce the results of the research project.
A software solution to this problem was proposed by Gentleman and Temple Lang, who "introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data, …), and as a means for distributing, managing and updating the collection." 1 They summarize the uses and implications of a compendium:
- it encapsulates the actual work of the author, not just an abridged version suitable for publication;
- it can display different levels of detail in derived documents;
- the computations included in it can be re-run by an interested reader, potentially with different inputs;
- it contains explicit computational details that make it easier for an interested reader to adapt and extend the methods;
- it enables programmatic construction of plots and tables;
- its components can be treated as data or inputs to software and manipulated programmatically in ways perhaps not envisioned by the author.
Org-babel and Org-mode provide the tools needed to create a multi-language compendium in a single Org-mode file. This example is taken from a work in progress, one that has seen many changes in structure and organization. No claim is made that it is the best way to do things. But it works and is proving extremely useful in the conduct of the research project.
Products of the Org-babel Compendium
The example Org-babel compendium is designed to produce three derived documents:
- a LaTeX document intended for publication in an academic journal
- a Beamer slide show to accompany a conference presentation
- a web page that chronicles the data entry process
The first two of these documents are held in Org-babel LaTeX code blocks, which are tangled to produce source files that can be compiled in the usual way by one of the LaTeX systems. The third uses the Org-mode HTML exporter to generate the document directly from the Org-mode file.
Organization of the Org-mode File
The Org-mode file is divided into eight sections:
* Org-mode Setup * Software Setup * Data * Instructions for Use * Documents * Project Tracking * Quality Control * Notes
Org-mode Setup
The goal of the Org-mode setup is to specify the environment as completely as possible so the file exhibits the same behavior on different computers with their own Org-mode setups.
Software Setup
The example project uses R software to analyze metric and categorical observations on a class of traditional Hawaiian stone tools known as adzes. The object of this section is to establish an R session and populate it with information from a remote MySQL server. Subsequent queries of the data for analysis are all local, which speeds up the process considerably.
-
The code block
r-adze-sessionloads libraries for preparing graphics and tables, connects to the remote MySQL server with a call to another R code block, populates an R dataframe,whole.adze, and lists the R objects that were created.#+src_name r-adze-session #+begin_src R :session adze :noweb yes library(ggplot2) library(xtable) <<r-connect>> <<r-complete-2>> objects() #+end_src
#+results: | con | | d.complete | | whole.adze |
#+srcname: r-connect #+begin_src R library(RMySQL) con <- dbConnect(MySQL(), user="user", password="password", dbname="dbname", host="host") #+end_src
#+srcname: r-complete-2 #+begin_src R whole.adze <- dbGetQuery(con, "select * from adze where edge_present = 'true' AND poll_present = 'true'") #+end_src
Data
This section puts the adze data in an Org-mode table for the interested reader. This gives access to the data without giving access to the MySQL server.
#+srcname: data-dump #+begin_src R :colnames yes :session adze whole.adze #+end_src
#+results: data-dump | id | identifier | storage_location | site | weight | adze_type | bevel | edge_present | chin_present | shoulder_present | poll_present | length_poll | length_shoulder | length_chin | width_edge | width_shoulder_front | width_shoulder_back | thickness_shoulder | thickness_chin | edge_angle | bevel_shape | edge_shape_a | edge_shape_b | face_reduced | butt_angle | color_value | complete | broken | reworked | polish | |----+------------------+------------------+-------------+--------+-----------+--------+--------------+--------------+------------------+--------------+-------------+-----------------+-------------+------------+----------------------+---------------------+--------------------+----------------+------------+-------------+--------------+--------------+--------------+------------+-------------+----------+--------+----------+---------| | 1 | OA B1-30-29 | Tray 1 | 50-Oa-B1-30 | 111 | primary | single | true | true | true | true | 92 | 48 | 11 | 33 | 29 | 30 | 16 | 11 | 36 | convex | straight | straight | true | 10 | 4 | complete | | | present | | 2 | 50-OA-B1-30-T8-1 | Tray 1 | 50-Oa-B1-30 | 32 | secondary | single | true | true | true | true | 58 | 28 | 19 | 19 | 18 | 18 | 11 | 10 | 35 | convex | straight | straight | false | 0 | 3 | complete | | other | present | ...
Instructions for Use
This section gives the interested reader basic instructions on how to create the derived documents.
This can also be helpful for the author of the Org-mode document.
* Instructions for Use :noexport: ** Generate HTML pages for adzes.tsdye2.com [/] - [ ] Run org-babel-execute-buffer, Ctrl-c Meta-b b, to refresh all the R code blocks - [ ] Publish, Ctrl-c Ctrl-e P - [ ] ftp, Mirror adzes subdomain ** Generate Print and Beamer documents [/] - [ ] Run org-babel-execute-buffer, Ctrl-c Meta-b b, to refresh all the R code blocks - [ ] Run org-babel-tangle, Ctrl-c Meta-b t, to generate adze_print.tex and adze_beamer.tex - [ ] Compile the tex files
Documents
Two documents are created with Org-babel code blocks. I find it easiest to outline the structure of the paper down to the level of the paragraph. The leaves of the Org-mode tree are paragraph topic sentences. This is a bit of work, but it is made easier by YASnippets for Org-babel code blocks and frequently used Beamer constructs. I find that the outlining process is an aid to writing and well worth the effort.
Note that the LaTeX code blocks each have a header argument
:results silent so that Org-babel doesn't put the results of
evaluating them in a #+results block.
* Documents ** Preamble *** LaTeX Preamble \documentclass{article} \author{A. N. Author} \title{Article Title} \begin{document} \maketitle *** Beamer Preamble \documentclass{beamer} \mode<presentation> { \usetheme{Malmoe} \usecolortheme{default} } \usepackage[english]{babel} \usepackage[latin1]{inputenc} \usepackage{times} \usepackage[T1]{fontenc} \institute{The Institute} \subject{An Interesting Subject} \beamerdefaultoverlayspecification{<+->} \usepackage{booktabs} \title{A Presentation Title} \author{A. N. Author} \begin{document} \maketitle ** Introduction *** LaTeX Source *** Beamer Source *** R Code *** Notes ** Methods ** Results ** Postamble
Text is entered in LaTeX code blocks using the full power of auctex and reftex.
Note the use of noweb references to insert the results of R code blocks directly into the LaTeX document.
Many sections of the LaTeX document, such as this obligatory description of the artifact collection, can be written while data capture is underway. When date capture is complete, the document can be refreshed.
** Description of the Collection *** Notes **** TODO Work out a summary of cross sections, with graphics, relate to Duff SCHEDULED: <2009-11-09 Mon> - plot width_shoulder_front on x, width_shoulder_back on y, thickness_shoulder on symbol size - quadrangular adzes will plot along x=y *** LaTeX source \section{Description of the Collection} \label{sec:desc-coll} The distribution of complete adze blade weights is shown in figure~\ref{fig:complete-weight}. The weight range is <<r-weight-min()>>--<<r-weight-max()>>~g. \begin{figure}[htb!] \centering \includegraphics[width=5in]{<<r-complete-weight-histogram-pdf()>>} \caption[Weights of complete adzes]{Weights of complete adzes on a logarithmic scale.} \label{fig:complete-weight} \end{figure} *** R routines **** Adze blade maximum weight (whole adzes) A simple retrieval of the maximum adze blade weight in grams. max(whole.adze$weight) : 3062 **** Adze blade minimum weight (whole adzes) A simple retrieval of the minimum adze blade weight in grams. min(whole.adze$weight) : 0 adze.wt <- ggplot(whole.adze, aes(x = weight)) adze.wt + geom_histogram() + scale_x_log10() ggsave(file = "adze_wt_log.pdf", width = 5, height = 3) file:r/adze_wt_log.pdf
I find it convenient to work on the Beamer slide show at the same time as I am writing the LaTeX document.
Figures created for the LaTeX document are often useful in the Beamer slide show.
Note the correspondence between the Beamer code block and the LaTeX code block above.
*** Beamer source \begin{frame} \frametitle{Description of the Collection} \begin{columns} \begin{column}{0.5\textwidth} The weight range is <<r-weight-min()>>--<<r-weight-max()>>~g \end{column} \begin{column}{0.5\textwidth} \begin{centering} \includegraphics[width = 0.5\textwidth]{<<r-complete-weight-histogram-pdf()>>}\par \end{centering} \end{column} \end{columns} \end{frame}
Project Tracking
This section produces an HTML document that is made available to collaborators so they can track project progress.
- Here is an example project tracking page
Note that the R code blocks each have a header argument, :exports none, to keep the source out of the HTML document.
Graphics created in R are saved to file. A link to the file
created by the header argument, :file, instructs the exporter to
insert the graphic into the HTML document.
* Project Tracking :export: ** Complete Adzes *** Adze Weight Quantiles of the complete adze blade weights: quantile(whole.adze$weight) : 0% 25% 50% 75% 100% : 0 22 38 280 3062 The weights of complete adze blades are plotted on a log scale to differentiate among the lighter blades adze.wt <- ggplot(whole.adze, aes(x = weight)) adze.wt + geom_histogram() + scale_x_log10() ggsave(file = "adze_wt_log.png") file:r/adze_wt_log.png
Quality Control
-
Quality control is achieved by:
- assigning a version to the compendium
*** Compendium Version
- Version 1
- documenting the software versions used by the author to produce the derived documents
** Software Version Information :noexport: *** Org-mode - The org-version function yields sufficient information to identify the Org-mode and Org-babel code used by the author (org-version nil) : Org-mode version 6.34trans (release_6.34c.221.gadb2) *** R - The version of R used by the author - Ideally, package versions would also be displayed version _ platform i386-apple-darwin8.11.1 arch i386 os darwin8.11.1 system i386, darwin8.11.1 status major 2 minor 9.2 year 2009 month 08 day 24 svn rev 49384 language R version.string R version 2.9.2 (2009-08-24)
-
marking result blocks as original products of the compendium
so readers can execute code blocks and compare results with
the original. The reader can exercise quality control by
comparing results against the author's. The function,
compendium-resultsadds a stamp, COMPENDIUM, to the names of results blocks. When the reader executes code blocks, the results can be compared to the COMPENDIUM blocks. To use this function, execute the code block withC-c C-cthen run it withM-x compendium-results RETURN.
(defun compendium-results () "Adds COMPENDIUM to #+results: block names." (interactive) (query-replace-regexp "\\(#\\+results:.*\\)$" "\\1 COMPENDIUM") )
Notes
This section holds notes, TODO items, etc. It provides a high level receptacle for items saved by remember or refiled.
Summary of the Org-babel Compendium
The Org-babel compendium fulfills the characteristics specified by Gentleman and Temple Lang:
- It encapsulates fully the actual work of the author, potentially down to the level of task scheduling and clock time
-
The derived documents display very different levels of detail, but
can share components where they overlap
- The LaTeX document for publication contains detail suitable for a journal article
- The Beamer slide show contains detail suitable for a conference talk
- The project tracking web site displays data as they are collected; data entry errors are caught at an early stage
- The computations carried out for any of the derived documents can be re-run by an interested reader, either with the original data stored in Org-mode tables, or with altered data sets, and new computations can be carried out on the original data
- Computational details are fully specified in the Org-babel compendium, which captures the data and parameters passed to functions, along with the version of the software that provides the functions
- Plots and tables in each of the derived documents are constructed programmatically and inserted into the derived documents either through direct reference or using noweb syntax
- The components of the Org-babel compendium can be treated as data or inputs to software, which either runs or can be made to run under Org-babel, thus allowing programmatic manipulation in ways different from those carried out by the author.
Gentleman and Temple Lang describe five kinds of software needed to create, manage, and distribute a compendium. Emacs with Org-mode and Org-babel carries out the tasks of all five kinds of software.
Authoring Software
Emacs with Org-mode and Org-babel leverages familiar tools to
create a compendium in a single file. It provides easy integration
and editing of code together with text. All the usual editing
tools are available when editing both text chunks and code chunks;
R code, for example, is edited using the facilities provided by
Emacs Speaks Statistics. There is a simple mechanism for evaluating
code chunks in the growing list of supported languages, i.e., place
point in the code block and press C-c C-c.
Auxiliary Software
Because the components of the compendium all reside in the same Org-mode file, no auxiliary software, other than the external software applications needed to evaluate code chunks, such as R, is required.
Transformation Software
Org-mode and Org-babel provide the necessary "collection of filters" to generate the various outputs.
Quality Control Software
The issue of quality control is meant to insure that the reader of
a compendium achieves the same results as the author of the
compendium. The compendium described here encourages the user to
check the distribution file's digital signature using md5, thus
providing a mechanism to ensure that the reader's file is identical
to the author's. The compendium also contains information on the
version numbers of the software used by the author. In the case of
this example these include the output of (org-version nil), which
yields a reference to an abbreviated description of the git HEAD,
and the detailed output of the R version command. Ideally, the
versions of R packages used in the computations would also be
included. Finally, the compendium contains a small function that
the author uses to tag #+results: blocks so they are not
overwritten by subsequent executions of their source blocks. In
this way, the reader can execute the source blocks and directly
compare results with those obtained by the author.
Distribution Software
The problems of distribution are largely solved by the fact that the Org-babel multi-language compendium can be distributed as a single ASCII text file. Because Emacs is ported to many operating environments, the compendium can be used by readers with a wide variety of hardware. In practice, the md5 digital signature should provide adequate protection against file corruption.