This tutorial is hosted at davekleinschmidt.com/r-packages/.
devtools
makes it surprisingly easy to make a data package. There are lots of good reasons why you should (reproducibility, convenience, sharing, documenting).
devtools::create()
the package skeletondevtools::use_data_raw()
and move raw data into data-raw/
data-raw/
, which reads in raw data and at calls devtools::use_data(<processed data>)
to save .RData
formatted data files in data/
.devtools::load_all()
to access datadevtools::install()
and then library(<package name>)
.The Github repository includes the Rmd source, and an example of how I’ve refactored an analysis along these lines, starting from an RMarkdown + CSV and ending up with an RMarkdown + a package.
If you want a sense of how this might work IRL, the example is based on this data package, which I’ve used for a couple of papers/presentations:
I’ve also compiled a few packages based on donated data to streamline my own workflow, some of which the originators of the data have generously agreed to release publicly:
An R package is the basic unit of reproducible code. There are lots of reasons you might want to make one:
lme4
).This guide focuses on the third use case. There are lots of good guides for the first case (Hilary Parker’s is the classic), and if you’re in the second camp you probably already know what you’re doing (and if not, Hadley Wickham’s R packages book is an excellent and thorough guide).
We’ll start from what used to be my default workflow (raw data file + R scripts to process and make figures etc.), and end up with an R package that allows you to both easily access pre-processed data and tracks how that data was generated from the raw form.
devtools
.R
or .Rmd
file, etc.)To a first approximation, a minimal package is
R/
andDESCRIPTION
(what the package is called, who made it, what it does, what other packages it depends on).When you load a package with library
or require
, R looks in the package directory and runs the stuff.R
files in R/
.
If there’s a data/
subdirectory in the package directory, R will also make any data files1 there available. In R, the dataset has the same name as the data file. There are (at least) three ways to access data from a package:
ggplot2::diamonds %>% head()
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
## 2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
## 3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
## 5 0.310 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
## 6 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
library()
Then you can refer to datasets directly
library(ggplot2)
diamonds %>% head()
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
## 2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
## 3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
## 5 0.310 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
## 6 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
But they’re not added to the global environment:
ls()
## character(0)
data()
This puts the requested dataset into the global environment:
data("diamonds", package="ggplot2")
ls()
## [1] "diamonds"
You can also get a listing of all the datasets in a package with data(package="ggplot2")
.
), and where it’ll live locally (defaults to ~/
, your home directory).DESCRIPTION
file, and edit it.R/
subdirectory.devtools
(the slightly less easy way)The devtools
package is the back-end that RStudio uses to set up your package. It provides a convenient set of functions for doing those steps manually if you don’t like clicking on buttons (or don’t want to use RStudio):
devtools::create('~/mypackage')
.DESCRIPTION
.R/
subdirectory.You probably don’t want to do this. These are the steps that devtools
and RStudio automate for you.
DESCRIPTION
file in your package-to-be directory.R/
subdirectory and put your code in there.The workflow when using a package is slightly different than you might be used to.
If you’re just working within the package directory, the steps are simple:
devtools::load_all()
(or ⌘-⇧-L in RStudio)If you’re working from an installed package (e.g. to use across multiple projects), when you edit the package code you need to build, install, and reload it:
devtools::document('/path/to/pkg')
devtools::install('/path/to/pkg')
.devtools::reload(inst('pkg'))
.Note that you only need to do this if you edit something in the package.
By default, all the functions and variables that are created in your package are private and not added to the global environment when you attach the package with library()
. The NAMESPACE
file tells R which things you want to export as part of the package’s namespace. The easiest, best, and most foolproof way to generate this file is using special comments before each function/variable you want to export. For instance:
#' Short description of what this does
#'
#' Longer description of what this does. Approximately a paragraph.
#'
#' @param x The first thing
#' @param y The second thing.
#' @return The thing that comes out of this function
#'
#' @export (do export this in NAMESPACE)
a_function <- function(x,y) {
return x+y
}
Then, you run devtools::document()
, which will update NAMESPACE
and create help files in, e.g., man/a_function.Rd
. And you can call mypackage::a_function()
now, or just a_function()
after library(myfunction)
.
See the R packages book on documentation for more information on this.
For our purposes, this isn’t strictly necessary (if you’re just doing devtools::load_all()
, but it’s important to know for later if you want to share this or devtools::install()
your package.
.RData
, .csv
, etc.) in the data/
subdirectory of your package.Bam, your data is in your package.
devtools
(good)The version of the data that’s worth packaging is probably not pure, raw data, but data that’s been cleaned, proccessed, summarized, or collated in some way. Just dropping pre-processed data files into your package is bad for the same reason that it’s bad to do all your data analysis directly in the console: there’s no record of what you’ve done, and there’s no reliable way to reproduce it once you close R.
The solution is the same: create a script that covers all the steps you took from beginning (loading a CSV) to end (putting the final data files in data/
). By including this script in your package along with the raw data, you get the convenience of having easy, fast access to the pre-processed data and all the benefits of reproducibility.
Here’s how:
Create a home for your raw raw data and preprocessing scripts to live in:
devtools::use_data_raw()
## Creating data-raw/
## Next:
## * Add data creation scripts in data-raw
## * Use devtools::use_data() to add data to package
data-raw/
Create an R script in data-raw/
that reads in the raw data, processes it, and puts it where it belongs. Such a script might look like this:
experiment1 <-
read.csv('expt1.csv') %>%
mutate(experiment = 1)
devtools::use_data(experiment1)
This saves data/experiment1.RData
in your package directory (make sure you’ve setwd()
to the package directory…)
Run this script to actually use the data (with source()
or ⌘-⇧-S in RStudio). Now when you load the package, the dataset will be available as experiment1
, already processed:
devtools::load_all()
experiment1 %>% head()
## or use data() to put it in the global environment
data("experiment1")
You can save as many versions of the data as you’d like. For instance, if you want to have easy access to a sumamrized version of the dataset, you can save that, too:
experiment1_summary <-
experiment1 %>%
group_by(subject, condition, block) %>%
summarise(mean_rt = mean(rt))
devtools::use_data(experiment1_summary)
(Optionally): commit the script in git. In my (and Hadley’s) opinion, using git to track changes in your code is always a good idea, and it’s integrated right into RStudio.
I also like to commit at least the processed data, and the raw data if it doesn’t have personally identifiable information in it. You’ll need to do this if you’re going to distribute the package over github etc. If the data files are very large, you can use something like Git Large File Storage.
(Optionally): Document your datasets. This works basically the same as documenting other objects, with the exception that the object you document is the name of the data set. The convention is to put these in R/data.R
:
#' Data from Experiment 1
#'
#' This is data from the first experiment ever to try XYZ using Mechanical
#' Turk workers.
#'
#' @format A data frame with NNNN rows and NN variables:
#' \describe{
#' \item{subject}{Anonymized Mechanical Turk Worker ID}
#' \item{trial}{Trial number, from 1..NNN}
#' ...
#' }
"experiment1"
Writing this documentation is slightly annoying, but a very good idea if you intend to share your data (and that includes with your advisor, students, labmates, or, especially, future-you)
You can, of course, do all of this from the R console. But, again, for the sake of reproducibility, it’s always better to put it in a script.
These scripts doesn’t have to live in data-raw/
, but that’s where you put your raw data. And you don’t want to run it every time the package is loaded, so it shouldn’t go in R/
.
Even if you’re packaging raw data, there’s still a good reason to do it this way: .RData
files are much faster to read from disk than text-based formats like CSV. So every time you use this data, you’re saving a little bit of time (or a lot of time if your data is even medium sized). This reduces the friction associated with re-compiling your .Rmd
files (say), or creating new sessions/.Rmd
files for each analysis, which in turn makes it way easier to make sure your analysis is really reproducible and self-contained.
Data files can be lots of things: .RData
, .csv
, .R
, etc. See ?data
.↩