We saw R
is a powerful software for data analysis. Contrary to other softwares
which provide you with ready, fixed, canned procedures to perform statistical
analyses, R always let you have access to all computations' results. R can also be
extended with packages. More than 6,000 of them are available via
CRAN. Furthermore, you can easily
create your own packages, or functions, to tackle your specific needs and/or
problems.
All good but you have to be able to reproduce your analysis, redo your figures and tables i.e. have available the computer code which produce your research results (more on that for example in Peng et al.).
R
can generate graphics in any type of display (pdf, postscript, bitmap, svg,
png, jpeg, tiff). Just tell R
to save the current image displayed:
plot(1:10)
dev.copy(png, file = "myPlot.png")
dev.off()
If you want to save the plot without seeing it on screen:
pdf(file = "myPlot.pdf")
plot(1:10)
dev.off()
If your plot was made with ggplot2
, you use:
ggsave("myPlot.pdf")
You save your commands in an R
script (e.g. myScript.R
). You can then go
through it line by line.
source()
runs R
commands held in an external source file, or R
script. If
you set the option to TRUE
, it will print the commands and results to the
console.
source('myScript.R')
You can also sink()
or capture.output
selected results from your source file
to an output file. For example if our source file produce summary statistcs from
dairy
:
dairy.sum <- summary(dairy)
sink("summary.log")
cat("My summary stats", fill = TRUE)
show(dairy.sum)
sink()
dairy.sum <- summary(dairy)
capture.output({dairy.sum}, file = "summary.txt")
You can even source several R
scripts into your main source file, like functions
you wrote.
Next, you can also run your script from the command line. For example save the
following as an R
script:
library(epicalc)
with(health.wide, cc(da, parity,
decimal = 2,
graph = FALSE,
original = TRUE,
design = "cohort",
alpha = .05,
fisher.or = FALSE,
exact.ci.or = TRUE))
with(health.wide, mhor(da, parity, mf,
design = "cohort",
decimal = 2,
graph = FALSE))
Now you can call it from the command line and save its output with Rscript
yourScript.R > yourOutput.Rout
. You can also use the batch command R CMD BATCH
yourScript.R yourOutput.txt
. Contrary to the file redirection, R CMD BATCH
captures warnings and errors. R CMD BATCH
also adds a call to proc.time()
to
give the time the script took to run.
Better yet is to embed the results and the code that generated it in the same
document. For this we use knitr
. In your script,
everything written after a comment #
will be converted to text and everything
else will be assumed to be R
code and executed with the spin()
function. A
markdown file is generated. It then
can be converted into a MS Office document with
pandoc. Try with the following and call it
with Rscript -e "knitr::spin('yourScript.R')"
. Then to convert it to
docx: pandoc -s -S yourScript.md -o results.docx
(note: you might not
have pandoc
installed on your computer).
#' Displaced Abomasum
#+ echo=FALSE
health <- read.csv(
file = "health.csv",
header = TRUE,
stringsAsFactors = FALSE)
health$presence <- factor(health$presence, labels = c("No", "Yes"))
health$parity <- cut(health$lactation,
breaks = c(0, 1, max(health$lactation)),
labels = c("1", "2+"))
library(reshape2)
health.wide <- dcast(
data = health,
formula = unique + age + lactation + parity ~ disease,
value.var = "presence"
)
library(stringr)
health.wide$id <- str_split_fixed(health.wide$unique, "-", 2)
health.wide$herd <- health.wide$id[, 1]
health.wide$cow <- health.wide$id[, 2]
health.wide$da <- with(health.wide, ifelse(da == "No", 0, 1))
#' The results of the logistic regression are the following:
#+ log, echo=FALSE
mod1 <- glm(da ~ parity,
family = binomial("logit"),
data = health.wide) # "logit" can be omitted as it is the default
#+ log-sum
summary(mod1)
You can also use stitch()
which insert the R
code into a predefined template
and create a simple report (Rscript -e "knitr::stitch_rhtml(yourScript.R')
).
Basic idea: Write data + software + documentation (or in this case manuscripts, reports) together.
Analysis code can be divided into text and code "chunks". Doing so allows us to
extract the code for machine readable documents (technically referred to as a
tangle
) or produce a human-readable document (also called weave
).
Literate programming involves three main steps:
Results from scientific research have to be easy to reproduce so others can verify results making them trustworthy. Otherwise we risk producing one off results that no one outside the original research group can reproduce. In this lesson we will learn reproducible research, which is one of the by products of dynamics report generation. However, this process alone will not always guarantee reproducibility.
install.packages("knitr", dependencies = TRUE)
Knitr supports a variety of documentation formats including markdown
, html
and LaTeX
. It also allows for easy export to PDF
and HTML
.
Markdown is an incredibly simple semantic file format, not too dissimilar from .doc, .rtf, or .txt. Markdown makes it easy for even those without significant knowledge of markup languages like HTML or LaTeX to write any sort of text (including with links, lists, bullets, etc.) and have it parsed into a variety of formats.
csv
over
xls
). This will ensure that your data will be readable long into the future. In RStudio, choose new R Markdown file (easiest way) or you can create a new
text file and save it with extension .Rmd
.
A basic code chunk looks like this.
You can knit this document using the knit button or do it programmatically using
the knit()
function.
library(knitr)
knit("file.Rmd")
What just happened?
knitr read the Rmd file, then located and ran all the code chunks identified by the backticks, and replaced it with the output of R function calls. If figures are generated from any such calls, they will be included in markdown syntax.
You can also name your code chunks. This allows you to keep all the code in a separate script and just refer to code chunks using meaningful names (e.g. data-processing, analysis, model-fitting, visualization, figures, tables)
Some rules on naming chunks
In addition to naming chunks within the curly braces, you can also add a bunch of other options on how that particular code chunk should behave.
Other options you can add to the tag
Option | Description |
---|---|
echo = TRUE or FALSE | to show or hide code. |
eval = TRUE or FALSE | to run or skip the code. |
warning = TRUE or FALSE | to show or hide function warnings. |
message = TRUE or FALSE | to show or hide function R messages. |
results = "hide" | will hide results. They will still be executed |
fig.height = | Height of figure |
fig.width = | width of figure |
Once your output markdown files (.md
) files are generated, you should never
edit them because they are automatically generated. Next time you knit the
original .Rmd
files, all the changes in the .md
file will get wiped out.
Write sentences in text with inline output
The mean is `r mean(1:5)`.
Global options are shared across all the following chunks after the location in which the options are set, and local options in the chunk header can override global options.
Dealing with long running process
By adding cache = TRUE
to a code block definition. After the first run,
results will be cached.
If you use ggplot2
from the data visualization section, you can have that
easily parsed into your document.
library(ggplot2)
p <- qplot(carat, price, data = diamonds) + geom_hex()
p
# no need to explicitly print(p)
Pandoc is a universal document converter. In
particular, Pandoc can convert Markdown to many other document formats,
including LaTeX, HTML, Rich Text Format (.rtf), E- Book (.epub), Microsoft
Word (.docx) and OpenDocument Text (.odt), etc.
Pandoc is a command line tool. Linux users should be fine with it; for Windows
users, the command window can be accessed via the Start menu, then Run cmd. Once
we have opened a command window (or terminal), we can type commands like this to
convert a Markdown file, say, test.md, to other formats: pandoc test.md -o
test.html
, pandoc test.md -o test.odt
, pandoc test.md -o test.rtf
, pandoc
test.md -o test.docx
, or pandoc test.md -o test.pdf
.
You can add more options to the basic Pandoc call. To see a full list of
options: pandoc --help
.
A commonly used option is to add margins using the -V
argument (in this case 1
inch): pandoc -V geometry:margin=1in test.md -o test.pdf
.
You could imagine splitting the different analytic operations (loading your
data, tidying it, visualizing your data, and running your analysis) into
different R
scripts that you could source
from a "master" script. These
scripts could even be inserted into a R markdown
or LaTeX
file.
source("load.R")
source("tidy.R")
source("analyze.R")
# load
health <- read.csv(
file = "health.csv",
header = TRUE,
stringsAsFactors = FALSE)
health$parity <- cut(health$lactation,
breaks = c(0, 1, max(health$lactation)),
labels = c("1", "2+"))
# tidy
library(reshape2)
health.wide <- dcast(
data = health,
formula = unique + age + lactation + parity ~ disease,
value.var = "presence"
)
# analyze
library(epicalc)
with(health.wide, cc(da, parity,
decimal = 2,
graph = FALSE,
original = TRUE,
design = "cohort",
alpha = .05,
fisher.or = FALSE,
exact.ci.or = TRUE))