Analysis essentials: An example directory structure for an analysis using R
There are a lot of practical skills involved in doing an analysis that are essential but that I rarely (never?) see included in the curriculum, statistics or otherwise. These are skills like how to organize your data, how to approach QAQC, and how to set up a naming algorithm for files. We all need to do these things, but too often we end up learning these skills by muddling through on our own.
There have been some nice papers relevant to the topic of what I’m calling “analysis essentials” that have come out recently and I’ve found to be really useful. See, for example, Good enough practices in scientific computing by Wilson et al. and Data organization in spreadsheets by Broman and Woo. I’m glad to see folks thinking about these things!
In this post I go over something that took me a long time to 1., realize was pretty darn important and 2., actually use, which is organizing a directory of my data, scripts, and output when doing an analysis. I refer to this as setting up my directory structure.
Some of this can be generalized to any set of analysis tools, but the tools I use are specifically related to using R for the analysis.
The first thing I did for this project was to create a folder to hold all the files for the analysis. This folder is called the root directory. It is the top level of the hierarchy of folders that will store the analysis files. Since my project was a discrete set of analyses that I was going to do one time, this relatively simple set-up worked well for me.
I use RStudio Projects for my analyses, and I create a project file (ending in
.Rproj) in my root directory. I use RStudio Projects mostly for convenience, since they make it so easy to go back to my analysis where I left off. Using RStudio Projects also helps me use relative file paths in my work.
If I switch to a different computer while working the analysis project (like when I work on my home computer), I move the root directory and not only individual files.
Not so many years ago, I was laboriously writing out my file paths for setting the working directory in
setwd() or reading/writing files. And then I’d work at home one day and have to manually update all those file paths.
All my scripts had code like
# Directory when at work, uncomment as needed setwd("N:/Atwork/filepath/tomyfiles") # Directory when at home, uncomment as needed # setwd("C:/Users/Owner/Documents/Athome/forsomereason/thisfilepath/isreallylong")
When you can’t even collaborate with yourself efficiently you know there has to be a better way.
Enter the here package. This package contains the
here() function, which finds the root directory based on some simple heuristics. One of these heuristics involves finding the folder in the hierarchy that contains the
.Rproj file. Storing my
.Rproj file in my root directory allows me to write all file paths relative to that directory.
Nowadays I don’t set a working directory at all, but instead read and write files using the
For example, I can read data in from a sub-folder called “Data” within my root directory. This code works on any computer I move the directory to, no matter that the absolute file path has changed.
dat = read.csv(here::here("Data", "mydata.csv"))
Similarly I can write code to save a final plot in a folder called “Plots”.
ggsave(file = "myplot1.png", path = here::here("Plots"))
Package here is great for simple, hierarchical directory structures like mine. For more complicated ones you may need the advanced options available in package rprojroot.
(If you’re not convinced that relative file paths are useful, see Jenny Bryan’s blog post about project-oriented workflows.)
One of the standard sub-folders I create within my root directory for an analysis is the “Data” folder. This is where (surprise! 😆) I store the data I’m going to be analyzing. I keep both original datasets from the domain expert and any edited versions here. I read datasets into R for exploration and analysis from this folder as I demonstrated above via relative file paths and the here package.
I always spend a considerable amount of time doing what a mentor of mine called “becoming one with the dataset” 🧘. This work includes making many different data summaries as well as making a lot of exploratory graphics. I stored the files for this task in the “Exploration” sub-folder.
These days I do my R coding work for projects like this exclusively within R markdown (
.Rmd) files instead of R scripts. The exploration files, in particular, I write so that I can easily share them with not only my future self but also my domain-expert collaborators. This means I have to be extremely strict with myself about going back and writing lucid comments detailing my understanding of the project as a whole, of individual variables, and things I notice as I explore the data. I will talk more about my approach to using R markdown files for exploration and analysis in a future “analysis essentials” post so I won’t go into any more detail here.
For this particular project I was doing separate analyses for data collected at different sites. I used distinct files to explore the data at each of five sites, which is one of the reasons I made a separate sub-folder to store these in.
I adopted a naming algorithm involving dates and site names following some of the suggestions in Jenny Bryan’s “naming things” slides. For example, my R markdown file names looked like
2017-06_watershed_revisited_explore_site1_do.Rmd, and I had HTML output files with the same names that I could send to my collaborator for review. (This project involved dissolved oxygen and temperature data, which is why it looks like I made the odd choice to end my file names with the word “do”. 🤣)
Much of the time data exploration also involves data cleaning. However, data cleaning was not part of my role in this particular project and so I did not have any specific files or folders dedicated to it.
The “Analysis” sub-folder is where I stored the R code files for the statistical analysis. I again used R markdown files, this time with comments and discussion primarily written for my future self (these end up looking suspiciously like comments I make for collaborators other than my future self 😜). Having the HTML output files means I can easily go back and see what I did and why along with the output without having to re-run any code. I did separate analyses per site for two distinct research questions so I ended up with a fair number of analysis scripts.
I put specific scripts for making publication-ready graphics of the raw data for the manuscript in this folder, as well. Code for plots of statistical results was contained within the analysis scripts.
In this particular project I had some analyses that ended up being extremely time-consuming to run. I saved those model objects into the “Analysis” folder as
.Rdata files so I wouldn’t have to re-run the models later to extract output, etc. Looking back now it might have been nice to save these in a sub-folder within “Analysis” to avoid some clutter.
I wrote up a final description of the statistical methods for my collaborator, which I also saved in the “Analysis” folder.
I store all “publication-ready” plots in this sub-folder. I’ve found it’s easier to have plots stored separately, otherwise the “Analysis” folder gets so busy things are hard to find.
I didn’t use the “Results” sub-folder much for this project because I wasn’t asked to make any final tables of statistical results. I did save final estimates from some models as CSV files here so we would have these in an easy-to-use format if needed.
In past projects I used this folder (possibly with a different name, like “Tables”) as a place to save publication-ready tables I made using R.
Some files didn’t fit within the 5 sub-folders I describe above, and those files ended up floating around in my root directory. These files range from publications I tracked down while deciding on a statistical approach to drafts of the manuscript that I reviewed. I see now I probably should have had at least one additional sub-folder to help with the organization of these.
I finished this project a year ago and had to revisit things this week to pull out some example plotting code for a consulting client. This is one of the reasons to get the files organized; I was able to quickly find the code I was looking for and send it along.