Org Mode: Data Collection and Analysis

{Back to Babel's index}

Table of Contents

Data Collection and Analysis

This example uses Org-babel to automate a repeated data-collection and analysis task. A Ruby code block is used to scrape data from the output of a computational experiment. This data is then written to an Org-mode table. A block of R code reads from this table and calculates lines of fit. Finally a block of gnuplot code is used to graph the results of both the raw data and the R analysis. By performing all of these steps within an Org-mode document working notes, discussion, and TODOs can be naturally interspersed with the code, and the results can easily be published to HTML or PDF for distribution.

Requirement

  • A working Ruby installation
  • A working R installation
  • A working gnuplot installation

Advantages

  • Org-babel handles passing the data between different programming languages
  • Raw data persists in tables in the Org-mode file
  • Working notes can be collocated with the code/results to which they refer
  • Tasks can be saved and updated from within the same file in which the work is being performed
  • Org-mode exporting facilities can be used to export the results to HTML or PDF for distribution

Disadvantages

  • This approach can allow the experimenter to use whatever language is most comfortable for each sub-task, sometimes resulting in an overly complicated work flow. For example, in the example below I did not have to learn how to calculate the mean and standard deviation in R since it was easier for me to do so in Ruby even though a full R solution would have been more efficient.

Example

Code for running experiment and collecting the results

This portion will not be repeatable as it would require the entire experimental setup. It is provided for demonstration.

Ruby run-timer-test: Runs the actual experiment. This is tangled to an external file and run on the command line – since these runs can take several days, I prefer to run them outside of Emacs (normally using screen).

DEFAULT_CMDLINE = "--swap 0 --del 0 --mut 0.1 example.c "

def run_and_package(cmdline, package)
  puts "#{package}: ../modify #{cmdline}"
  start_time = Time.now
  %x{../modify #{cmdline}}
  total_time = Time.now - start_time
  %x{echo "wall clock #{total_time}" >> gcd.c-.debug}
  %x{rake package[#{package}]}
end

100.times do |n|
  # run with default options
  run_and_package(DEFAULT_CMDLINE, "normal_#{n}")
  run_and_package("--pll_fit 2 "+DEFAULT_CMDLINE, "pll_2_#{n}")
  run_and_package("--pll_fit 3 "+DEFAULT_CMDLINE, "pll_3_#{n}")
  run_and_package("--pll_fit 4 "+DEFAULT_CMDLINE, "pll_4_#{n}")
  run_and_package("--pll_fit 5 "+DEFAULT_CMDLINE, "pll_5_#{n}")
  run_and_package("--pll_fit 6 "+DEFAULT_CMDLINE, "pll_6_#{n}")
  run_and_package("--pll_fit 7 "+DEFAULT_CMDLINE, "pll_7_#{n}")
  run_and_package("--pll_fit 8 "+DEFAULT_CMDLINE, "pll_8_#{n}")
end

Ruby parse-output: The execution of run-timer-test leaves results distributed across many text log files. The following Ruby source code block is used to collect results from these files and dump them into an Org-mode file as a table.

def look(path)
  processors = if path.match(/normal/)
                 "1"
               elsif path.match(/pll_(\d+)_/)
                 $1
               else
                 0
               end
  results = File.read(File.join(path, "gcd.c-.debug"))
  generations =  results.match(/^Generations to solution: (\d+)/) ? Integer($1) : -1
  total = results.match(/^ +TOTAL +([\d\.]+) /) ? Float($1) : -1
  wall = results.match(/^wall clock ([\d\.]+)/) ? Float($1) : -1
  fitness = results.match(/^ +fitness +([\d\.]+) +([\d\.]+) /) ? Float($2) : -1
  mutation = results.match(/^ +mutation +([\d\.]+) +([\d\.]+) /) ? Float($2) : -1
  [path, processors, total, wall, good_test, bad_test, compile, fitness, generations]
end

# puts "| path | processors | total | wall | fitness | mutation | generations |"
# puts "|-----------"

Dir.entries('./').select{|e| e.match(/[normalpll]+[_\d]+/)}.
  map{|e| look(e)}.each{|row| puts "| "+row.join(" | ")+" |"}

Data

Here is fake example output from the parse-output Ruby source code block above.

normal_01150.264150.631066163.01
pll_2_0240.02540.69894439.03
pll_3_032.50431.2145532.01
normal_511.4991.8663622.02
pll_2_1621.431.9851521.01
normal_3111.5011.8674532.01
pll_2_2921.4311.9783121.01
normal_2214.5624.9298973.03
pll_4_543.6096.9530264.01
normal_41161.097161.464041181.01
pll_3_331.75133.8198362.01
pll_4_2499.546102.2023772.02
pll_4_145.50219.8753833.01
pll_3_131.9763.5405652.02
pll_3_631.4332.0185721.01

Analysis

The code blocks in this section will be repeatable as they rely on the fake data given above.

Ruby calculate mean and standard deviation over the second column

by_procs = {}
raw.each do |row|
  by_procs[row[1]] ||= []
  by_procs[row[1]] << row[3]
end

by_procs.each do |key, vals|
  mean = vals.inject(0){|sum, n| sum + n} / vals.size
  stddev = Math.sqrt(vals.inject(0){|sum, n| sum + ((n - mean).abs * (n - mean).abs)} / vals.size)
  puts "| #{key} | #{mean} | #{stddev} |"
end
164.151763875.1190856698136
214.887469333333318.2514689828405
317.648381514.9070317402304
443.010259666666742.1863032424348

R find the curve that best fits these data

procs <- data$V1
times <- data$V2
df <- data.frame(procs, times)
nlsfit <- nls(times~c0 + (load/procs), data=df, start=list(load = 100, c0 = 20))
summary(nlsfit)

gnuplot plot the raw data, along with the error bars and the best fit curve

set xrange [0.5:5]
set yrange [0:]
set ylabel "seconds"
set xlabel "processes"
plot data using 2:4 with points title 'raw' linecolor 8
replot mydata using 1:2:3 with errorbars title 'error' linecolor 1
replot 11.12 + 45.70/x title 'fit'

Which produces the following ../../../images/babel/example-graph.png

Distribution

Using Org-mode's exporting capabilities it is easy to publish the entire working file including source-code and raw data, to share sections using `org-narrow-to-subtree', or even to share individual tables or graphs.

Documentation from the http://orgmode.org/worg/ website (either in its HTML format or in its Org format) is licensed under the GNU Free Documentation License version 1.3 or later. The code examples and css stylesheets are licensed under the GNU General Public License v3 or later.