At least once a year I meet with a graduate student who has many separate datasets that need to be combined into a single file. The data are usually from a series of data loggers (e.g., iButtons or RFID readers) that record data remotely over a specified time period. The researcher periodically downloads the data from each data logger and then redeploys it for further data collection.

I’m going to set up the background for my particular use case before jumping into the R code to perform this sort of task. Go straight to “List all files to read in” if you want to get right into R code.

What’s so hard about reading in many datasets?

For someone who is at least somewhat familiar with a programming language (e.g., SAS, Python, R), reading many datasets in and combining them into a single file might not seem like a big deal. For example, if I do a quick web search on “r read many datasets” I get at least 5 Stack Overflow posts (with answers) as well as several blog entries. These links show code for relatively simple situations of reading many identical dataset in to R (a couple SO examples can be found here and here).

However, in my experience this work doesn’t feel very simple to beginner programmers. Most of the graduate students I meet with have never worked with any sort of programming language prior to entering their degree program. By the time I meet with them they usually have had a very basic introduction to R in their intro statistics courses. They may have read in a dataset into R only a couple of times at most, and now they have hundreds of them to manage. To further complicate things, there is usually more work that needs to be done beyond reading the datasets in, such as adding important identifying information to each dataset.

Aside: I am couching this around R because that’s what taught in the intro courses in the Statisics Department at my university. I still get a few requests every year for help with SAS programming from faculty and post-docs and so far over six years I’ve had exactly one student client who worked primarily with Python

Why would I want to have to do this in R (or SAS or Python or …)?

That is the question I got from the first student I ever advised on this topic. She was collecting data using many data loggers in a field experiment set up as randomized complete block design with repeated measures. She had 300 comma-delimited files that needed to be concatenated together from her first field season and was planning on a second season that was at least twice as long (so would have at least twice as many files).

Her research collaborators had used these loggers previously, and had given her the following algorithm to follow:

  1. Open each file in Excel
  2. Manually delete the first 15 rows, which contained information about the data logger that wasn’t related to the study
  3. Add columns to indicate the physical study units the data was collected in (“Block”, “Site”, “Plot”)
  4. Copy and paste into a new, combined file
  5. Repeat with all datasets
  6. Name columns

Me:

But really this gave me a chance to discuss reproducibility and the convenience of using computers to do repetitive tasks with the next generation of researchers. While she would need to expend effort understanding R code, the effort in the short term would be valuable in the long term given she was going to do a second field season.

To be honest, she was pretty skeptical that it made sense to use a programming language to do the work. From her perspective, doing the work via R looked more difficult and more “black-box” than manually copying and pasting in Excel. It helped that we found mistakes in the files she’d already edited when when we were setting up the R code (it’s so easy to make mistakes when copying and pasting 300 times!).

Her skepticism continues to be a good reminder to me of what it feels like to be a beginner in a programming language, where you don’t quite trust that the program is doing what you want it to and you don’t exactly fully understand the code. The manual approach you already know how to do can look pretty darn attractive by comparison.

List all files to read in

When reading in many files, my first two tasks are:

  • Getting a list of files to read
  • Figuring out the steps needed to read and manipulate a single file

Here I’ll start by getting a list of the files. I’m using some toy files I made to mirror what the real files looked like. These CSV files are available on the blog GitHub repository.

Listing all files can be done in R via list.files or dir. I’ll use list.files here (no reason, just habit most likely based on what I learned first).

For this particular case I will use four of the arguments in list.files:

  1. The directory containing the files needs to be defined in the "path" argument of list.files. I’m working within an RStudio Project, and will use the here package to indicate the directory the files are in relative to the root directory. See Jenny Bryan’s post here on the merits of here and self-contained projects.

  2. The "pattern" argument is used to tell R which file paths should be listed. If you want to read in all CSV files in a directory, for example, the pattern to match might by ".csv". In the real scenario, there were additional CSV files in the directory that we didn’t want to read. All the files we wanted to read ended in “AB.csv”, so we first defined the pattern as "AB.csv". Later we realized that some file names were all lowercase, so used "AB.csv|ab.csv". The vertical pipe, |, stands for “or”.

  3. The "recursive" argument is used to indicate whether or not child folders in the parent directory should be searched for files to list or not. It defaults to FALSE. The files in this particular case are not stored in a single folder. Instead they are in child folders within an overall “data” directory. The names of the child folders actually indicate the study units (“Blocks” and “Sites”) the data were collected in.

  4. The "full.names" argument is used to indicate if the complete file paths should be returned or only the relative file paths. In this case, the only place in the information about some of the physical units of the study (“Blocks” and “Sites”) are in the directory path. We needed the full path names in order to extract that information and add it to the dataset.

library(here) # v. 0.1

The list.files function returns a vector of file paths.

( allfiles = list.files(path = here("static", "data"),
                        pattern = "AB.csv|ab.csv",
                        full.names = TRUE,
                        recursive = TRUE) )
# [1] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block1/Siteone/SIT1_17_12_21_5.2_AB.csv"
# [2] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block1/Siteone/sit1_17_12_31_2.2_ab.csv"
# [3] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block1/Siteone/SIT1_17_12_9_5.2_AB.csv" 
# [4] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block2/Sitenew/SIT1_17_12_10_3.2_AB.csv"
# [5] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block2/Sitenew/SIT1_17_12_21_3.2_AB.csv"
# [6] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block2/Sitenew/SIT1_17_12_31_5.2_AB.csv"

Practice reading in one file

I find things go more smoothly if I work out the file-reading process with a single file before I try to read a bunch of files. It’s an easy step to want to skip because it feels more efficient to do “everything at once”. I’ve never found that to actually be the case. 👅

I’ll practice with the first file listed in allfiles. The top 6 lines of the raw data file is all extraneous header information, which will be skipped via "skip". There are no column headers ("header") in the file, so those need to be added ("col.names").

( test = read.csv(allfiles[1], 
                skip = 6,
                header = FALSE,
                col.names = c("date", "temperature") ) )
#   date temperature
# 1   15           9
# 2   16           8
# 3   17          15
# 4   18           9
# 5   19          10

That went pretty smoothly, but things get a little hairy from here. The information on the physical units of the study, “Blocks” and “Sites”, are contained only in the file directory path. These need to be extracted from the file path and added to the dataset.

In addition, the “Plot” information is contained in the file name. Plot names are single numbers that are found directly before the period in the file name. In allfiles[1] that number is 5 (the file name is “SIT1_17_12_21_5.2_AB.csv”).

Last, the final two digits of the file name is a code to indicate where the data logger was located. This also needs to be added to the dataset. In the toy example these values are all “AB”, but in the larger set of files this wasn’t true.

All the tasks above are string manipulation tasks. I will tackle these with the functions from the stringr package.

library(stringr) # v. 1.2.0

Extract “Block” names from the file path

Since some information is located within the file path string, splitting the file path up into separate pieces seems like a reasonable first step. This can be done via str_split using "/" as the symbol to split on. As there is only a single character string to split for each dataset, it is convenient to return a matrix instead of a list via simplify = TRUE.

The result is a matrix containing strings in each column.

( allnames = str_split( allfiles[1], pattern = "/", simplify = TRUE) )
#      [,1] [,2]    [,3]    [,4]        [,5]      [,6]   [,7]      [,8]    
# [1,] "C:" "Users" "Owner" "Documents" "Aosmith" "Blog" "aosmith" "static"
#      [,9]   [,10]    [,11]     [,12]                     
# [1,] "data" "Block1" "Siteone" "SIT1_17_12_21_5.2_AB.csv"

The “Block” information is always the third column if counting from the end (it’s the 10th in this case if counting from the beginning). It’s “safer” (i.e., less likely to fail on a different computer) to count from the last column and work backwards here; on a different computer the full file paths may change length but the directory containing the files to read will be the same.

To extract the third column from the end I take the total number of columns and subtract 2.

allnames[, ncol(allnames) - 2]
# [1] "Block1"

This can be added to the dataset as a “block” variable.

test$block = allnames[, ncol(allnames) - 2]
test
#   date temperature  block
# 1   15           9 Block1
# 2   16           8 Block1
# 3   17          15 Block1
# 4   18           9 Block1
# 5   19          10 Block1

Extract “Site” names from the file path

This will be the same as above, except site names are contained in the second-to-last column.

test$site = allnames[, ncol(allnames) - 1]

Extract “Plot” names from the file name

The last character string in our matrix is the file name, which contains the plot name and logger location. In the test case the plot name is “5” and the logger location is “AB”.

allnames[, ncol(allnames)]
# [1] "SIT1_17_12_21_5.2_AB.csv"

This can be split on the underscores and periods and the “Plot” information extracted in much the same was as the “Block” information. I think this option can feel more approachable to beginners and is a reasonable way to solve the problem.

Another option is to define which part of the file name we want to pull out. This involves using regular expressions. I personally find regular expressions quite difficult and invariably turn to Stack Overflow to find the answers. I will use them here to demonstrate a wider breadth of options

A basic introduction to regular expressions is on the second page of the “Work with Strings” cheatsheet from RStudio. A more in depth set of examples can be found on the UBC ST 545 page here.

I used str_extract from stringr with a regular expression for the "pattern" to extract the plot number. In this case I used a lookaround following this Stack Overflow answer. These can be apparently be costly in terms of performance, which I did not find that to be a deterrent in my case. 😄

The regular expression I use indicates I want to extract a digit ([0-9]) that comes immediately before a period. The (?=\\.) is a positive lookahead, telling R that the digit to match will be followed by a period. The plot names are always just in front of a period, which is why this works.

str_extract(allnames[, ncol(allnames)], pattern = "[0-9](?=\\.)")
# [1] "5"

This can then be added to the dataset.

test$plot = str_extract(allnames[, ncol(allnames)], pattern = "[0-9](?=\\.)")

Extract data logger location from the file name

The last thing to do is extract the data logger location code from the file names. These are the last two digits of the file name, immediately before “.csv”.

The str_sub function from stringr is useful for extracting characters from fixed locations in a string. The logger location information is in the same position in every file name if counting from the end of the string. The str_sub function allows the user to pull information counting from the end of the string as well as from the beginning. Because our file names differ in length due to the way that dates are stored, the location data does not always have the same indices if counting characters from the beginning of the string.

Negative indices are used to extract from the end of the string. The location information is stored in the 5th and 6th positions from the end. The negative number largest in absolute value is passed to start and the smallest in absolute value to end. (I forget this pretty much every time I subset strings from the end.)

str_sub(allnames[, ncol(allnames)], start = -6, end = -5)
# [1] "AB"

The location data can then be added to the dataset. Since at least one of the file names is all lowercase, I make sure the data logger location information is converted to all caps via toupper.

test$logloc = toupper( str_sub(allnames[, ncol(allnames)], start = -6, end = -5) )

Here’s what the test dataset looks like now.

test
#   date temperature  block    site plot logloc
# 1   15           9 Block1 Siteone    5     AB
# 2   16           8 Block1 Siteone    5     AB
# 3   17          15 Block1 Siteone    5     AB
# 4   18           9 Block1 Siteone    5     AB
# 5   19          10 Block1 Siteone    5     AB

Make a function to read all the files

Once the process is worked out for one file, I can “functionize” it (i.e., make a function). This allows me to apply the exact same procedure to every dataset as it is read in.

The function I create below takes a single argument: the file path of the dataset. The function reads the file and then adds all the desired columns. It returns the modified dataset.

read_fun = function(path) {
     test = read.csv(path, 
                skip = 6,
                header = FALSE,
                col.names = c("date", "temperature") )
     allnames = str_split( path, pattern = "/", simplify = TRUE)
     test$block = allnames[, ncol(allnames) - 2]
     test$site = allnames[, ncol(allnames) - 1]
     test$plot = str_extract(allnames[, ncol(allnames)], pattern = "[0-9](?=\\.)")
     test$logloc = toupper( str_sub(allnames[, ncol(allnames)], start = -6, end = -5) )
     test
}

I’ll test the function with that first file path again to make sure it works like I expect it to.

read_fun(allfiles[1])
#   date temperature  block    site plot logloc
# 1   15           9 Block1 Siteone    5     AB
# 2   16           8 Block1 Siteone    5     AB
# 3   17          15 Block1 Siteone    5     AB
# 4   18           9 Block1 Siteone    5     AB
# 5   19          10 Block1 Siteone    5     AB

Looks good!

Read all the files

All that’s left to do now is to loop through all the file paths in allfiles, read and modify each one with my function, and stack them together into a single dataset. This can be done in base R with a for or lapply loop. If using either of those options, the final concatenation step can be done via rbind in do.call.

These days I’ve been using the map functions from package purrr for this, mostly because the map_dfr variant conveniently binds everything together by rows for me.

library(purrr) # v. 0.2.3

Here’s what using map_dfr looks like, looping through each element of allfiles to read and modify the datasets with the read_fun function and then stacking everything together into a final combined dataset.

( combined_dat = map_dfr(allfiles, read_fun) )
#    date temperature  block    site plot logloc
# 1    15           9 Block1 Siteone    5     AB
# 2    16           8 Block1 Siteone    5     AB
# 3    17          15 Block1 Siteone    5     AB
# 4    18           9 Block1 Siteone    5     AB
# 5    19          10 Block1 Siteone    5     AB
# 6     1          12 Block1 Siteone    2     AB
# 7     2          15 Block1 Siteone    2     AB
# 8     3          21 Block1 Siteone    2     AB
# 9     4          20 Block1 Siteone    2     AB
# 10    5          20 Block1 Siteone    2     AB
# 11    6          13 Block1 Siteone    2     AB
# 12    1          10 Block1 Siteone    5     AB
# 13    2          19 Block1 Siteone    5     AB
# 14    3          17 Block1 Siteone    5     AB
# 15    4           6 Block1 Siteone    5     AB
# 16    5           5 Block1 Siteone    5     AB
# 17    6          10 Block1 Siteone    5     AB
# 18    7          15 Block1 Siteone    5     AB
# 19    8          16 Block1 Siteone    5     AB
# 20    9          10 Block1 Siteone    5     AB
# 21    1           9 Block2 Sitenew    3     AB
# 22    2           8 Block2 Sitenew    3     AB
# 23    3          15 Block2 Sitenew    3     AB
# 24    5          10 Block2 Sitenew    3     AB
# 25    6           9 Block2 Sitenew    3     AB
# 26    7          10 Block2 Sitenew    3     AB
# 27    8           8 Block2 Sitenew    3     AB
# 28    1          11 Block2 Sitenew    5     AB
# 29    2          12 Block2 Sitenew    5     AB
# 30    3          13 Block2 Sitenew    5     AB
# 31    4          18 Block2 Sitenew    5     AB
# 32    5          19 Block2 Sitenew    5     AB
# 33    6          18 Block2 Sitenew    5     AB
# 34    8          19 Block2 Sitenew    5     AB
# 35    7          18 Block2 Sitenew    5     AB
# 36    9          19 Block2 Sitenew    5     AB
# 37   10          10 Block2 Sitenew    5     AB

Are we finally done?

Hopefully! 😄

In working with real data, the final “combining” step can lead to errors due to unanticipated complexities. In my experience, this most often happens because some of the datasets are physically different than the rest. I’ve worked on problems where, for example, it turned out some datasets were present in the directory but were empty.

In the real files for this particular example, it turned out some of the files had been previously modified manually to remove the header information. We ended up adding an if statement to the function to test each file as we read it in. If it had the header information we’d use skip while reading in the dataset and if it didn’t we wouldn’t. I did something similar in the case where some of the datasets were blank.

After the combined dataset has been created, you might want to save it for further data exploration and/or analysis. If working on an interim set of datasets (such as before a field season is over), saving the R object with saveRDS can be pretty convenient. Saving a final dataset as a CSV may be useful for sharing with collaborators once all datasets have been downloaded and combined, which can be done with, e.g., write.csv.