Org Mode: Data Collection and Analysis

Data Collection and Analysis

This example uses Org-babel to automate a repeated data-collection and analysis task. A Ruby code block is used to scrape data from the output of a computational experiment. This data is then written to an Org-mode table. A block of R code reads from this table and calculates lines of fit. Finally a block of gnuplot code is used to graph the results of both the raw data and the R analysis. By performing all of these steps within an Org-mode document working notes, discussion, and TODOs can be naturally interspersed with the code, and the results can easily be published to HTML or PDF for distribution.

Requirement

A working Ruby installation
A working R installation
A working gnuplot installation

Advantages

Org-babel handles passing the data between different programming languages.
Raw data persists in tables in the Org-mode file.
Working notes can be collocated with the code/results to which they refer.
Tasks can be saved and updated from within the same file in which the work is being performed.
Org-mode exporting facilities can be used to export the results to HTML or PDF for distribution.

Disadvantages

This approach can allow the experimenter to use whatever language is most comfortable for each sub-task, sometimes resulting in an overly complicated work flow. For example, in the example below I did not have to learn how to calculate the mean and standard deviation in R since it was easier for me to do so in Ruby even though a full R solution would have been more efficient.

Example

Code for running experiment and collecting the results

This portion will not be repeatable as it would require the entire experimental setup. It is provided for demonstration.

Ruby run-timer-test: Runs the actual experiment. This is tangled to an external file and run on the command line – since these runs can take several days, I prefer to run them outside of Emacs (normally using screen).

DEFAULT_CMDLINE = "--swap 0 --del 0 --mut 0.1 example.c "

def run_and_package(cmdline, package)
  puts "#{package}: ../modify #{cmdline}"
  start_time = Time.now
  %x{../modify #{cmdline}}
  total_time = Time.now - start_time
  %x{echo "wall clock #{total_time}" >> gcd.c-.debug}
  %x{rake package[#{package}]}
end

100.times do |n|
  # run with default options
  run_and_package(DEFAULT_CMDLINE, "normal_#{n}")
  run_and_package("--pll_fit 2 "+DEFAULT_CMDLINE, "pll_2_#{n}")
  run_and_package("--pll_fit 3 "+DEFAULT_CMDLINE, "pll_3_#{n}")
  run_and_package("--pll_fit 4 "+DEFAULT_CMDLINE, "pll_4_#{n}")
  run_and_package("--pll_fit 5 "+DEFAULT_CMDLINE, "pll_5_#{n}")
  run_and_package("--pll_fit 6 "+DEFAULT_CMDLINE, "pll_6_#{n}")
  run_and_package("--pll_fit 7 "+DEFAULT_CMDLINE, "pll_7_#{n}")
  run_and_package("--pll_fit 8 "+DEFAULT_CMDLINE, "pll_8_#{n}")
end

Ruby parse-output: The execution of run-timer-test leaves results distributed across many text log files. The following Ruby source code block is used to collect results from these files and dump them into an Org-mode file as a table.

def look(path)
  processors = if path.match(/normal/)
                 "1"
               elsif path.match(/pll_(\d+)_/)
                 $1
               else
                 0
               end
  results = File.read(File.join(path, "gcd.c-.debug"))
  generations =  results.match(/^Generations to solution: (\d+)/) ? Integer($1) : -1
  total = results.match(/^ +TOTAL +([\d\.]+) /) ? Float($1) : -1
  wall = results.match(/^wall clock ([\d\.]+)/) ? Float($1) : -1
  fitness = results.match(/^ +fitness +([\d\.]+) +([\d\.]+) /) ? Float($2) : -1
  mutation = results.match(/^ +mutation +([\d\.]+) +([\d\.]+) /) ? Float($2) : -1
  [path, processors, total, wall, good_test, bad_test, compile, fitness, generations]
end

# puts "| path | processors | total | wall | fitness | mutation | generations |"
# puts "|-----------"

Dir.entries('./').select{|e| e.match(/[normalpll]+[_\d]+/)}.
  map{|e| look(e)}.each{|row| puts "| "+row.join(" | ")+" |"}

Data

Here is fake example output from the parse-output Ruby source code block above.

normal_0	1	150.264	150.631066	163.0	1
pll_2_0	2	40.025	40.698944	39.0	3
pll_3_0	3	2.504	31.214553	2.0	1
normal_5	1	1.499	1.866362	2.0	2
pll_2_16	2	1.43	1.985152	1.0	1
normal_31	1	1.501	1.867453	2.0	1
pll_2_29	2	1.431	1.978312	1.0	1
normal_22	1	4.562	4.929897	3.0	3
pll_4_5	4	3.609	6.953026	4.0	1
normal_4	1	161.097	161.464041	181.0	1
pll_3_3	3	1.751	33.819836	2.0	1
pll_4_2	4	99.546	102.20237	72.0	2
pll_4_1	4	5.502	19.875383	3.0	1
pll_3_1	3	1.976	3.540565	2.0	2
pll_3_6	3	1.433	2.018572	1.0	1

Analysis

The code blocks in this section will be repeatable as they rely on the fake data given above.

Ruby calculate mean and standard deviation over the second column

by_procs = {}
raw.each do |row|
  by_procs[row[1]] ||= []
  by_procs[row[1]] << row[3]
end

by_procs.each do |key, vals|
  mean = vals.inject(0){|sum, n| sum + n} / vals.size
  stddev = Math.sqrt(vals.inject(0){|sum, n| sum + ((n - mean).abs * (n - mean).abs)} / vals.size)
  puts "| #{key} | #{mean} | #{stddev} |"
end

1	64.1517638	75.1190856698136
2	14.8874693333333	18.2514689828405
3	17.6483815	14.9070317402304
4	43.0102596666667	42.1863032424348

R find the curve that best fits these data

procs <- data$V1
times <- data$V2
df <- data.frame(procs, times)
nlsfit <- nls(times~c0 + (load/procs), data=df, start=list(load = 100, c0 = 20))
summary(nlsfit)

gnuplot plot the raw data, along with the error bars and the best fit curve.

set xrange [0.5:5]
set yrange [0:]
set ylabel "seconds"
set xlabel "processes"
plot data using 2:4 with points title 'raw' linecolor 8
replot mydata using 1:2:3 with errorbars title 'error' linecolor 1
replot 11.12 + 45.70/x title 'fit'

Which produces the following

Distribution

Using Org-mode's exporting capabilities it is easy to publish the entire working file including source-code and raw data, to share sections using `org-narrow-to-subtree', or even to share individual tables or graphs.