In this tutorial, we’ll cover how to use R Markdown to make your work in R reproducible. Whether you’re doing homework, writing a paper, or collaborating with people in your (or someone else’s) lab, you want to make sure that your work can be reproduced by others.
## [1] "R version 4.0.2 (2020-06-22)"
## Package Version
## rmarkdown 2.3
## tidyverse 1.3.0
## readr 1.3.1
## tidyr 1.1.2
## dplyr 1.0.2
## papaja 0.1.0.9997
“R Markdown is a file format for making dynamic documents with R. An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code.” - From Garrett Grolemund’s Introduction to R Markdown
R Markdown allows you to easily make documents inside of R that can be knitted to various formats such as MS Word, PDF, or HTML (all of these tutorials, for example) to name a few. Through R Markdown, you can easily share your code AND your results.
I am a proponent of making as much information available as possible when it comes to research and learning. As you’ve likely noticed with the tutorials available on this site, this includes the versions of R and packages used. While updates to packages or R don’t typically have an adverse impact on the functionality of code, I’d much rather have the versions that I used documented in case it were to happen.
Because of this, I prefer to have a section in all of my R Markdown documents that details the versions of R and packages used. This can be seen above. Some handy functions for finding this information are provided below.
R.Version()[["version.string"]]
## [1] "R version 4.0.2 (2020-06-22)"
packageVersion("tidyverse")
## [1] '1.3.0'
One of the big benefits of using R Markdown to document your work is that it keeps a record of every single change you make to your data.
As outlined in the tutorial on importing data, it’s possible to import your data directly from the internet. If your data is not copyrighted and you are able to store your raw data online in an accessible repository, such as Github, then this is a good first step to making your research reproducible. Importing your data directly from a link that readers are able to access lends transparency to all of the future steps conducted in your paper.
In the example used in this tutorial, we’ll use male stature data taken from the CDC Growth Charts for the United States.
dat <- read_csv(file = "https://raw.githubusercontent.com/KRR1114/public_data_files/master/male_ht.csv")
If you’re curious about the raw data used, you can copy the link provided in the code above to view the data in your internet brower. Alternatively, you can copy the code chunk above and run it in your own R window.
Another benefit of using R Markdown is that you are able to very clearly, and reproducibly, clean, tidy, or manipulate your data without changing the original data file. If you’d like to learn more about this process, we have a tutorial on tidying data.
head(dat[2:9], n = 7)
## # A tibble: 7 x 8
## age mean sd `3rd` `5th` `10th` `25th` `50th`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Birth_NA 51.4 2.9 45.7 46.4 48.3 49.5 51.4
## 2 0–0.99_months3 52.0 2.46 48.3 48.3 48.8 50.8 52.1
## 3 1–1.99_months3 55.5 2.53 50.8 51.4 52.1 53.3 55.9
## 4 2–2.99_months3 58.7 2.5 54.6 54.6 55.9 56.8 58.4
## 5 3–3.99_months4 62.1 2.52 57.8 58.3 59 60.3 61.9
## 6 4–4.99_months4 64.2 2.58 59.6 60 61 62.5 64.1
## 7 5–5.99_months 66.7 3.36 62.7 63 63.5 65 66.6
For the most part, this data is tidy; however, there are some small things that we may wish to fix - mostly for the sake of illustration. You may have noticed that some age
values have unnecessary characters attached to them. We’ll go ahead and remove these in the following code. Note: we’ll be editing our dataframe, dat
, but the original data will remain unchanged - unlike editing a file in excel or SPSS (unless you save it as a completely new file).
dat <- dat %>%
mutate(age = str_replace(string = age, pattern = "Birth_NA", replacement = "Birth")) %>%
mutate(age = str_replace(string = age, pattern = "months3", replacement = "months")) %>%
mutate(age = str_replace(string = age, pattern = "months4", replacement = "months"))
Now, we can again look at the first 7 rows of the age
variable in our dataframe, dat
, and see that the unnecessary characters are gone. But unlike other methods of cleaning data, doing so inside of R Markdown leaves a complete paper trail of all the things that are done to change the original data.
head(dat$age, n = 7)
## [1] "Birth" "0–0.99_months" "1–1.99_months" "2–2.99_months"
## [5] "3–3.99_months" "4–4.99_months" "5–5.99_months"
In addition to cleaning variables, R Markdown allows us to leave a paper trail for any new variables that we add to the dataset. This example comes from the writing functions tutorial. Below, we’ll write a function for creating new standard deviations based on various intraclass correlation coefficients. We’ll then use this function to add new columns to our dataframe.
sd_from_icc <- function(sd, icc = 1) {
x <- sd^2
sig_sq_err <- ((x - icc*(x))/icc)
sig_sq_tot <- (x + sig_sq_err)
new_sd <- sqrt(sig_sq_tot)
return(new_sd)
}
dat$sd_icc80 <- sd_from_icc(sd = dat$sd, icc = .8)
dat$sd_icc70 <- sd_from_icc(sd = dat$sd, icc = .7)
dat$sd_icc50 <- sd_from_icc(sd = dat$sd, icc = .5)
As you can see below, we have new columns added to our dataframe, but we also have a complete record of how the values in those columns were calculated.
head(dat[c(2:4, 14:16)])
## # A tibble: 6 x 6
## age mean sd sd_icc80 sd_icc70 sd_icc50
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Birth 51.4 2.9 3.24 3.47 4.10
## 2 0–0.99_months 52.0 2.46 2.75 2.94 3.48
## 3 1–1.99_months 55.5 2.53 2.83 3.02 3.58
## 4 2–2.99_months 58.7 2.5 2.80 2.99 3.54
## 5 3–3.99_months 62.1 2.52 2.82 3.01 3.56
## 6 4–4.99_months 64.2 2.58 2.88 3.08 3.65
In addition to making data management reproducible, R Markdown allows you to keep a detailed record of any and all analyses that you conduct for a project. For this section, we’re going to switch over to the mtcars
dataset that is available in R. This dataset contains data from the 1974 Motor Trend US magazine.
Let’s say that we were wanting to regress mpg
(our outcome variable - miles per gallon) onto hp
(our predictor variable - gross horsepower). In many papers, you would find the results reported in a few sentences in the results section, such as:
“The result of our regression indicates that horsepower significantly predicted miles per gallon, b = -0.07, t(30) = -6.74, p < .001. Additionally, horsepower explained a significant proportion of the variance in miles per gallon, R2 = .59, F(1, 30) = 45.46, p < .001.”
This is still a good method for reporting results; however, adding the use of R Markdown allows you to embed the results directly from your analysis. This helps to combat accidental typos, as well as more nefarious things such as altering p-values, etc. There are a number of ways to embed your results, but we’ll just touch on a few.
If you’re working on a non-formal paper (such as personal documentations of your work or for lab meetings), you may want to just embed the entire results table into your document by including code (shown below) that documents the analysis.
m1 <- lm(formula = mpg ~ hp, data = mtcars)
summary(m1)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Alternatively, if you’re working on a formal paper that needs to be formatted a certain way, you can use inline R code to supplement your text. For example, we can use the apa_print()
function from the papaja
package (an incredibly useful tool for writing APA format papers) to make writing our results much simpler and less error prone. The following code returns a list that I’ve labeled m1.r
.
m1.r <- apa_print(m1)
m1.r
## $estimate
## $estimate$Intercept
## [1] "$b = 30.10$, 95\\% CI $[26.76$, $33.44]$"
##
## $estimate$hp
## [1] "$b = -0.07$, 95\\% CI $[-0.09$, $-0.05]$"
##
## $estimate$modelfit
## $estimate$modelfit$r2
## [1] "$R^2 = .60$"
##
## $estimate$modelfit$r2_adj
## [1] "$R^2_{adj} = .59$"
##
## $estimate$modelfit$aic
## [1] "$\\mathrm{AIC} = 181.24$"
##
## $estimate$modelfit$bic
## [1] "$\\mathrm{BIC} = 185.64$"
##
##
##
## $statistic
## $statistic$Intercept
## [1] "$t(30) = 18.42$, $p < .001$"
##
## $statistic$hp
## [1] "$t(30) = -6.74$, $p < .001$"
##
## $statistic$modelfit
## $statistic$modelfit$r2
## [1] "$F(1, 30) = 45.46$, $p < .001$"
##
##
##
## $full_result
## $full_result$Intercept
## [1] "$b = 30.10$, 95\\% CI $[26.76$, $33.44]$, $t(30) = 18.42$, $p < .001$"
##
## $full_result$hp
## [1] "$b = -0.07$, 95\\% CI $[-0.09$, $-0.05]$, $t(30) = -6.74$, $p < .001$"
##
## $full_result$modelfit
## $full_result$modelfit$r2
## [1] "$R^2 = .60$, $F(1, 30) = 45.46$, $p < .001$"
##
##
##
## $table
## A data.frame with 5 labelled columns:
##
## predictor estimate ci statistic p.value
## 1 Intercept 30.10 $[26.76$, $33.44]$ 18.42 < .001
## 2 Hp -0.07 $[-0.09$, $-0.05]$ -6.74 < .001
##
## predictor: Predictor
## estimate : $b$
## ci : 95\% CI
## statistic: $t(30)$
## p.value : $p$
## attr(,"class")
## [1] "apa_results" "list"
We’ll then be using inline R code to subset our list.
“The result of our regression indicates that horsepower significantly predicted miles per gallon,
`r printnum(m1.r[["full_result"]][["hp"]])`
. Additionally, horsepower explained a significant proportion of the variance in miles per gallon,`r printnum(m1.r[["full_result"]][["modelfit"]][["r2"]])`
.”
The above text will render as follows:
“The result of our regression indicates that horsepower significantly predicted miles per gallon, \(b = -0.07\), 95% CI \([-0.09\), \(-0.05]\), \(t(30) = -6.74\), \(p < .001\). Additionally, horsepower explained a significant proportion of the variance in miles per gallon, \(R^2 = .60\), \(F(1, 30) = 45.46\), \(p < .001\)”