Sometimes you have multiple observations per subject, across time or for different chacteristics of the subject. It is said the data is in "wide" format if there's only one observation/row per subject with each measurement/characteristic set as a different variable. It is in "long" format if there's one observation/row per measurement (i.e. multiple rows per subject). Reshaping data is like transposing rows and columns, switching from wide to long format and vice versa.
Using the prod
dataset which gives the milk production at each monthly tests
(dim
= days in milk), what is its format?
prod <- read.csv(
file = "prod.csv",
header = TRUE,
stringsAsFactors = FALSE
)
We can go from wide to long format with melt
function in the
reshape2
package. You could do it with base R
functions but reshape2
has
simpler and more consistent functions for these operations.
library(reshape2)
prod.long <- melt(
data = prod,
id = "unique",
measure.vars = 2:11,
variable.name = "test",
value.name = "milk"
)
prod.long2 <- melt(
data = prod,
id = "unique",
measure.vars = 12:21,
variable.name = "test",
value.name = "dim"
)
# combining melt and dcast
prod.l1 <- melt(prod, id = "unique")
library(stringr)
prod.l1 <- transform(prod.l1, month = str_replace(variable, "^.*\\.", ""),
variable = str_replace(variable, "\\..*$", ""))
prod.l2 <- dcast(prod.l1, unique + month ~ variable)
# in 1 line with base::reshape
prod.l3 <- reshape(prod, dir = "long", varying = 2:21, sep = ".")
We can go from long to wide format with the dcast
function.
Transform the health dataset
from long to wide format.
health.wide <- dcast(
data = health,
formula = unique + age + lactation + parity ~ disease,
value.var = "presence"
)
We can use the stringr
package to manipulate strings. For example, unique
is
made of the herd id and the cow id separated by a dash. If you want to break it
in two to get back each pieces of id you could:
library(stringr)
health.wide$id <- str_split_fixed(health.wide$unique, "-", 2)
health.wide$herd <- health.wide$id[, 1]
health.wide$cow <- health.wide$id[, 2]
See help for stringr
to get all the possibilities. Note that you could also
use base R
function to do this.
You might want to join the two datasets so that we could link diseases and milk
production. Base R
has the merge
function for this.
health.prod <- merge(health.wide, prod, by = "unique")
plyr
library has also the join
function to merge together data
frames. plyr
has other interesting functions as well. For example, when you
melt
the prod
dataset, you noticed that month
could be better
formatted. plyr
has the mutate
function to add or replace existing columns,
and it would be more interesting to order columns with arrange
function:
library(plyr)
prod.long <- mutate(prod.long, test = sub("milk.", "", test))
prod.long2 <- mutate(prod.long2, test = sub("dim.", "", test))
prod.long <- join(prod.long, prod.long2, by = c("unique", "test"))
prod.long <- arrange(prod.long, unique, test)
Is it ordered by id
and month
? Why not?