1 Overview

1.1 Packages (and versions) used in this document

2 What is R Markdown?

3 How to make your work reproducible

3.1 Packages used
3.2 Data management

3.2.1 Importing data
3.2.2 Cleaning and tidying data
3.2.3 Manipulating data

3.3 Analyses

3.3.1 Include result tables
3.3.2 Inline code to display results

1 Overview

In this tutorial, we’ll cover how to use R Markdown to make your work in R reproducible. Whether you’re doing homework, writing a paper, or collaborating with people in your (or someone else’s) lab, you want to make sure that your work can be reproduced by others.

1.1 Packages (and versions) used in this document

## [1] "R version 4.0.2 (2020-06-22)"

##    Package    Version
##  rmarkdown        2.3
##  tidyverse      1.3.0
##      readr      1.3.1
##      tidyr      1.1.2
##      dplyr      1.0.2
##     papaja 0.1.0.9997

2 What is R Markdown?

“R Markdown is a file format for making dynamic documents with R. An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code.” - From Garrett Grolemund’s Introduction to R Markdown

R Markdown allows you to easily make documents inside of R that can be knitted to various formats such as MS Word, PDF, or HTML (all of these tutorials, for example) to name a few. Through R Markdown, you can easily share your code AND your results.

3 How to make your work reproducible

3.1 Packages used

I am a proponent of making as much information available as possible when it comes to research and learning. As you’ve likely noticed with the tutorials available on this site, this includes the versions of R and packages used. While updates to packages or R don’t typically have an adverse impact on the functionality of code, I’d much rather have the versions that I used documented in case it were to happen.

Because of this, I prefer to have a section in all of my R Markdown documents that details the versions of R and packages used. This can be seen above. Some handy functions for finding this information are provided below.

R.Version()[["version.string"]]

## [1] "R version 4.0.2 (2020-06-22)"

packageVersion("tidyverse")

## [1] '1.3.0'

3.2 Data management

One of the big benefits of using R Markdown to document your work is that it keeps a record of every single change you make to your data.

3.2.1 Importing data

As outlined in the tutorial on importing data, it’s possible to import your data directly from the internet. If your data is not copyrighted and you are able to store your raw data online in an accessible repository, such as Github, then this is a good first step to making your research reproducible. Importing your data directly from a link that readers are able to access lends transparency to all of the future steps conducted in your paper.

In the example used in this tutorial, we’ll use male stature data taken from the CDC Growth Charts for the United States.

dat <- read_csv(file = "https://raw.githubusercontent.com/KRR1114/public_data_files/master/male_ht.csv")

If you’re curious about the raw data used, you can copy the link provided in the code above to view the data in your internet brower. Alternatively, you can copy the code chunk above and run it in your own R window.

3.2.2 Cleaning and tidying data

Another benefit of using R Markdown is that you are able to very clearly, and reproducibly, clean, tidy, or manipulate your data without changing the original data file. If you’d like to learn more about this process, we have a tutorial on tidying data.

head(dat[2:9], n = 7)

## # A tibble: 7 x 8
##   age             mean    sd `3rd` `5th` `10th` `25th` `50th`
##   <chr>          <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 Birth_NA        51.4  2.9   45.7  46.4   48.3   49.5   51.4
## 2 0–0.99_months3  52.0  2.46  48.3  48.3   48.8   50.8   52.1
## 3 1–1.99_months3  55.5  2.53  50.8  51.4   52.1   53.3   55.9
## 4 2–2.99_months3  58.7  2.5   54.6  54.6   55.9   56.8   58.4
## 5 3–3.99_months4  62.1  2.52  57.8  58.3   59     60.3   61.9
## 6 4–4.99_months4  64.2  2.58  59.6  60     61     62.5   64.1
## 7 5–5.99_months   66.7  3.36  62.7  63     63.5   65     66.6

For the most part, this data is tidy; however, there are some small things that we may wish to fix - mostly for the sake of illustration. You may have noticed that some age values have unnecessary characters attached to them. We’ll go ahead and remove these in the following code. Note: we’ll be editing our dataframe, dat, but the original data will remain unchanged - unlike editing a file in excel or SPSS (unless you save it as a completely new file).

dat <- dat %>% 
  mutate(age = str_replace(string = age, pattern = "Birth_NA", replacement = "Birth")) %>% 
  mutate(age = str_replace(string = age, pattern = "months3", replacement = "months")) %>% 
  mutate(age = str_replace(string = age, pattern = "months4", replacement = "months"))

Now, we can again look at the first 7 rows of the age variable in our dataframe, dat, and see that the unnecessary characters are gone. But unlike other methods of cleaning data, doing so inside of R Markdown leaves a complete paper trail of all the things that are done to change the original data.

head(dat$age, n = 7)

## [1] "Birth"         "0–0.99_months" "1–1.99_months" "2–2.99_months"
## [5] "3–3.99_months" "4–4.99_months" "5–5.99_months"

3.2.3 Manipulating data

In addition to cleaning variables, R Markdown allows us to leave a paper trail for any new variables that we add to the dataset. This example comes from the writing functions tutorial. Below, we’ll write a function for creating new standard deviations based on various intraclass correlation coefficients. We’ll then use this function to add new columns to our dataframe.

sd_from_icc <- function(sd, icc = 1) { 
  x <- sd^2
  sig_sq_err <- ((x - icc*(x))/icc)
  sig_sq_tot <- (x + sig_sq_err)
  new_sd <- sqrt(sig_sq_tot)
  return(new_sd)
}

dat$sd_icc80 <- sd_from_icc(sd = dat$sd, icc = .8)
dat$sd_icc70 <- sd_from_icc(sd = dat$sd, icc = .7)
dat$sd_icc50 <- sd_from_icc(sd = dat$sd, icc = .5)

As you can see below, we have new columns added to our dataframe, but we also have a complete record of how the values in those columns were calculated.

head(dat[c(2:4, 14:16)])

## # A tibble: 6 x 6
##   age            mean    sd sd_icc80 sd_icc70 sd_icc50
##   <chr>         <dbl> <dbl>    <dbl>    <dbl>    <dbl>
## 1 Birth          51.4  2.9      3.24     3.47     4.10
## 2 0–0.99_months  52.0  2.46     2.75     2.94     3.48
## 3 1–1.99_months  55.5  2.53     2.83     3.02     3.58
## 4 2–2.99_months  58.7  2.5      2.80     2.99     3.54
## 5 3–3.99_months  62.1  2.52     2.82     3.01     3.56
## 6 4–4.99_months  64.2  2.58     2.88     3.08     3.65

3.3 Analyses

In addition to making data management reproducible, R Markdown allows you to keep a detailed record of any and all analyses that you conduct for a project. For this section, we’re going to switch over to the mtcars dataset that is available in R. This dataset contains data from the 1974 Motor Trend US magazine.

Let’s say that we were wanting to regress mpg (our outcome variable - miles per gallon) onto hp (our predictor variable - gross horsepower). In many papers, you would find the results reported in a few sentences in the results section, such as:
“The result of our regression indicates that horsepower significantly predicted miles per gallon, b = -0.07, t(30) = -6.74, p < .001. Additionally, horsepower explained a significant proportion of the variance in miles per gallon, R² = .59, F(1, 30) = 45.46, p < .001.”

This is still a good method for reporting results; however, adding the use of R Markdown allows you to embed the results directly from your analysis. This helps to combat accidental typos, as well as more nefarious things such as altering p-values, etc. There are a number of ways to embed your results, but we’ll just touch on a few.

3.3.1 Include result tables

If you’re working on a non-formal paper (such as personal documentations of your work or for lab meetings), you may want to just embed the entire results table into your document by including code (shown below) that documents the analysis.

m1 <- lm(formula = mpg ~ hp, data = mtcars)
summary(m1)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

3.3.2 Inline code to display results

Alternatively, if you’re working on a formal paper that needs to be formatted a certain way, you can use inline R code to supplement your text. For example, we can use the apa_print() function from the papaja package (an incredibly useful tool for writing APA format papers) to make writing our results much simpler and less error prone. The following code returns a list that I’ve labeled m1.r.

m1.r <- apa_print(m1)
m1.r

## $estimate
## $estimate$Intercept
## [1] "$b = 30.10$, 95\\% CI $[26.76$, $33.44]$"
## 
## $estimate$hp
## [1] "$b = -0.07$, 95\\% CI $[-0.09$, $-0.05]$"
## 
## $estimate$modelfit
## $estimate$modelfit$r2
## [1] "$R^2 = .60$"
## 
## $estimate$modelfit$r2_adj
## [1] "$R^2_{adj} = .59$"
## 
## $estimate$modelfit$aic
## [1] "$\\mathrm{AIC} = 181.24$"
## 
## $estimate$modelfit$bic
## [1] "$\\mathrm{BIC} = 185.64$"
## 
## 
## 
## $statistic
## $statistic$Intercept
## [1] "$t(30) = 18.42$, $p < .001$"
## 
## $statistic$hp
## [1] "$t(30) = -6.74$, $p < .001$"
## 
## $statistic$modelfit
## $statistic$modelfit$r2
## [1] "$F(1, 30) = 45.46$, $p < .001$"
## 
## 
## 
## $full_result
## $full_result$Intercept
## [1] "$b = 30.10$, 95\\% CI $[26.76$, $33.44]$, $t(30) = 18.42$, $p < .001$"
## 
## $full_result$hp
## [1] "$b = -0.07$, 95\\% CI $[-0.09$, $-0.05]$, $t(30) = -6.74$, $p < .001$"
## 
## $full_result$modelfit
## $full_result$modelfit$r2
## [1] "$R^2 = .60$, $F(1, 30) = 45.46$, $p < .001$"
## 
## 
## 
## $table
## A data.frame with 5 labelled columns:
## 
##   predictor estimate                 ci statistic p.value
## 1 Intercept    30.10 $[26.76$, $33.44]$     18.42  < .001
## 2        Hp    -0.07 $[-0.09$, $-0.05]$     -6.74  < .001
## 
## predictor: Predictor 
## estimate : $b$ 
## ci       : 95\% CI 
## statistic: $t(30)$ 
## p.value  : $p$ 
## attr(,"class")
## [1] "apa_results" "list"

We’ll then be using inline R code to subset our list.

“The result of our regression indicates that horsepower significantly predicted miles per gallon, `r printnum(m1.r[["full_result"]][["hp"]])`. Additionally, horsepower explained a significant proportion of the variance in miles per gallon,`r printnum(m1.r[["full_result"]][["modelfit"]][["r2"]])`.”

The above text will render as follows:

“The result of our regression indicates that horsepower significantly predicted miles per gallon, , 95% CI , , , . Additionally, horsepower explained a significant proportion of the variance in miles per gallon, , , ”

Reproducibility through R Markdown

Original Author: Kyle R. Ripley

Updated on 6/25/2020 by Kyle R. Ripley