Reproducible Research

We saw R is a powerful software for data analysis. Contrary to other softwares which provide you with ready, fixed, canned procedures to perform statistical analyses, R always let you have access to all computations' results. R can also be extended with packages. More than 6,000 of them are available via CRAN. Furthermore, you can easily create your own packages, or functions, to tackle your specific needs and/or problems.

All good but you have to be able to reproduce your analysis, redo your figures and tables i.e. have available the computer code which produce your research results (more on that for example in Peng et al.).

Saving Figures to File

R can generate graphics in any type of display (pdf, postscript, bitmap, svg, png, jpeg, tiff). Just tell R to save the current image displayed:

plot(1:10)
dev.copy(png, file = "myPlot.png")
dev.off()

If you want to save the plot without seeing it on screen:

pdf(file = "myPlot.pdf")
plot(1:10)
dev.off()

If your plot was made with ggplot2, you use:

ggsave("myPlot.pdf")

Saving Commands in R Scripts

You save your commands in an R script (e.g. myScript.R). You can then go through it line by line.

Sourcing Code

source() runs R commands held in an external source file, or R script. If you set the option to TRUE, it will print the commands and results to the console.

source('myScript.R')

You can also sink() or capture.output selected results from your source file to an output file. For example if our source file produce summary statistcs from dairy:

dairy.sum <- summary(dairy)
sink("summary.log")
cat("My summary stats", fill = TRUE)
show(dairy.sum)
sink()
dairy.sum <- summary(dairy)
capture.output({dairy.sum}, file = "summary.txt")

You can even source several R scripts into your main source file, like functions you wrote.

Source File from Shell

Next, you can also run your script from the command line. For example save the following as an R script:

library(epicalc)
with(health.wide, cc(da, parity,
   decimal = 2,
   graph = FALSE, 
   original = TRUE,
   design = "cohort",
   alpha = .05,
   fisher.or = FALSE, 
   exact.ci.or = TRUE))

with(health.wide, mhor(da, parity, mf,
               design = "cohort",
               decimal = 2,
               graph = FALSE))

Now you can call it from the command line and save its output with Rscript yourScript.R > yourOutput.Rout. You can also use the batch command R CMD BATCH yourScript.R yourOutput.txt. Contrary to the file redirection, R CMD BATCH captures warnings and errors. R CMD BATCH also adds a call to proc.time() to give the time the script took to run.

Stitch and Spin

Better yet is to embed the results and the code that generated it in the same document. For this we use knitr. In your script, everything written after a comment # will be converted to text and everything else will be assumed to be R code and executed with the spin() function. A markdown file is generated. It then can be converted into a MS Office document with pandoc. Try with the following and call it with Rscript -e "knitr::spin('yourScript.R')". Then to convert it to docx: pandoc -s -S yourScript.md -o results.docx (note: you might not have pandoc installed on your computer).

#' Displaced Abomasum

#+ echo=FALSE
health <- read.csv(
    file = "health.csv",
    header = TRUE,
    stringsAsFactors = FALSE)
health$presence <- factor(health$presence, labels = c("No", "Yes"))
health$parity <- cut(health$lactation,
                     breaks = c(0, 1, max(health$lactation)),
                     labels = c("1", "2+"))
library(reshape2)
health.wide <- dcast(
    data = health,
    formula = unique + age + lactation + parity ~ disease,
    value.var = "presence"
    )
library(stringr)
health.wide$id <- str_split_fixed(health.wide$unique, "-", 2)
health.wide$herd <- health.wide$id[, 1]
health.wide$cow <- health.wide$id[, 2]
health.wide$da <- with(health.wide, ifelse(da == "No", 0, 1))

#' The results of the logistic regression are the following:

#+ log, echo=FALSE
mod1 <- glm(da ~ parity,
             family = binomial("logit"),
             data = health.wide)  # "logit" can be omitted as it is the default

#+ log-sum
summary(mod1)

You can also use stitch() which insert the R code into a predefined template and create a simple report (Rscript -e "knitr::stitch_rhtml(yourScript.R')).

Literate programming with knitr

Basic idea: Write data + software + documentation (or in this case manuscripts, reports) together.

Analysis code can be divided into text and code "chunks". Doing so allows us to extract the code for machine readable documents (technically referred to as a tangle) or produce a human-readable document (also called weave).

Literate programming involves three main steps:

  1. Separate the narrative from the code
  2. Execute source code and return the results.
  3. Combine the results from the source code with the original narratives to produce a final document.

Why this is important?

Results from scientific research have to be easy to reproduce so others can verify results making them trustworthy. Otherwise we risk producing one off results that no one outside the original research group can reproduce. In this lesson we will learn reproducible research, which is one of the by products of dynamics report generation. However, this process alone will not always guarantee reproducibility.

Installing knitr

install.packages("knitr", dependencies = TRUE)

Knitr supports a variety of documentation formats including markdown, html and LaTeX. It also allows for easy export to PDF and HTML.

What is markdown?

Markdown is an incredibly simple semantic file format, not too dissimilar from .doc, .rtf, or .txt. Markdown makes it easy for even those without significant knowledge of markup languages like HTML or LaTeX to write any sort of text (including with links, lists, bullets, etc.) and have it parsed into a variety of formats.

When to implement reproducibility via literate programming?

  • You can do this anytime but it is always best to do this at the beginning of a project. Much like writing unit tests, waiting till after a project is completed is a bad idea.
  • Best used alongside version control like Git.
  • Use software like R where instructions are scripted.
  • Never save the output (only raw dataset with pre-processing code. Even cleaned datasets can be discarded but it might help to save them temporarily during intermediate steps).
  • Finally, store the data in a non-proporietary format (e.g. csv over xls). This will ensure that your data will be readable long into the future.

Creating a basic knitr document

In RStudio, choose new R Markdown file (easiest way) or you can create a new text file and save it with extension .Rmd.

A basic code chunk looks like this.


You can knit this document using the knit button or do it programmatically using the knit() function.

library(knitr)
knit("file.Rmd")

What just happened?

knitr read the Rmd file, then located and ran all the code chunks identified by the backticks, and replaced it with the output of R function calls. If figures are generated from any such calls, they will be included in markdown syntax.

Chunk labels

You can also name your code chunks. This allows you to keep all the code in a separate script and just refer to code chunks using meaningful names (e.g. data-processing, analysis, model-fitting, visualization, figures, tables)

Some rules on naming chunks

  • Chunk labels are supposed to be unique id’s in a document.
  • knitr will throw an error if two chunks have the same name.
  • If no chunk names are given, knitr will simply increment from chunk 1, 2,3 etc.

In addition to naming chunks within the curly braces, you can also add a bunch of other options on how that particular code chunk should behave.

Other options you can add to the tag

Option Description
echo = TRUE or FALSE to show or hide code.
eval = TRUE or FALSE to run or skip the code.
warning = TRUE or FALSE to show or hide function warnings.
message = TRUE or FALSE to show or hide function R messages.
results = "hide" will hide results. They will still be executed
fig.height = Height of figure
fig.width = width of figure

Once your output markdown files (.md) files are generated, you should never edit them because they are automatically generated. Next time you knit the original .Rmd files, all the changes in the .md file will get wiped out.

Write sentences in text with inline output

The mean is `r mean(1:5)`. 

Global options

Global options are shared across all the following chunks after the location in which the options are set, and local options in the chunk header can override global options.

Other Options

Dealing with long running process

By adding cache = TRUE to a code block definition. After the first run, results will be cached.

Working with graphics in knitr

If you use ggplot2 from the data visualization section, you can have that easily parsed into your document.

library(ggplot2)
p <- qplot(carat, price, data = diamonds) + geom_hex()
p 
# no need to explicitly print(p)

Pandoc

Pandoc is a universal document converter. In particular, Pandoc can convert Markdown to many other document formats, including LaTeX, HTML, Rich Text Format (.rtf), E- Book (.epub), Microsoft Word (.docx) and OpenDocument Text (.odt), etc. Pandoc is a command line tool. Linux users should be fine with it; for Windows users, the command window can be accessed via the Start menu, then Run cmd. Once we have opened a command window (or terminal), we can type commands like this to convert a Markdown file, say, test.md, to other formats: pandoc test.md -o test.html, pandoc test.md -o test.odt, pandoc test.md -o test.rtf, pandoc test.md -o test.docx, or pandoc test.md -o test.pdf.

You can add more options to the basic Pandoc call. To see a full list of options: pandoc --help.

A commonly used option is to add margins using the -V argument (in this case 1 inch): pandoc -V geometry:margin=1in test.md -o test.pdf.


Workflow

You could imagine splitting the different analytic operations (loading your data, tidying it, visualizing your data, and running your analysis) into different R scripts that you could source from a "master" script. These scripts could even be inserted into a R markdown or LaTeX file.

source("load.R")
source("tidy.R")
source("analyze.R")
# load
health <- read.csv(
    file = "health.csv",
    header = TRUE,
    stringsAsFactors = FALSE)

health$parity <- cut(health$lactation,
                     breaks = c(0, 1, max(health$lactation)),
                     labels = c("1", "2+"))
# tidy
library(reshape2)
health.wide <- dcast(
    data = health,
    formula = unique + age + lactation + parity ~ disease,
    value.var = "presence"
    )
# analyze
library(epicalc)
with(health.wide, cc(da, parity,
   decimal = 2,
   graph = FALSE, 
   original = TRUE,
   design = "cohort",
   alpha = .05,
   fisher.or = FALSE, 
   exact.ci.or = TRUE))