Importing data from multiple files simultaneously in R

A comparison between base R and Tidyverse methods for importing data from multiple files
r
tidyverse
base
purrr
import
Author
Published

Sunday, July 5, 2020

TLDR

We need to import several CSV or TXT files and merge them into one data frame in R. Regardless of what function we use to import the files, vectorising the operation using purrr::map in combination with do.call or dplyr::bind_rows is the most time-efficient method (~25 ms importing 50 files with 10,000 rows each), compared to for loops (~220ms) or using lapply (~123 ms). data.table::fread is the fastest function for importing data. Importing TXT files is slightly faster than importing CSV files.

Why this post

To analyse data in any programming environment, one must first import some data. Sometimes, the data we want to analyse are distributed across several files in the same folder. I work with eye-tracking data from toddlers. This means that I work with multiple files that have many rows. At 120 Hz sampling frequency, we take ~8.33 samples per second. A session for one participants can take up to 10 minutes. So these files are somewhat big. These data also tend to be messy, requiring a lot of preprocessing. This means that I need to import the same large files many times during the same R session when wrangling my way through the data, which takes a few seconds. After some iterations, it can be annoying. I have decided to invest all my lost time into analysing what method for importing and merging large files is the fastest in R so that the universe and I are even again.

Below I provide several options for importing data from the different files, using base R and tidyverse, among other tools. I will compare how long it takes to import and merge data using each method under different circumstances. You can find the whole code here in case you want to take a look, reproduce it or play with it1.

1 Ironically, this code is super inefficient and messy. It takes ages to run, and has been written by copy-pasting multiple times. I didn’t feel like doing anything more elegant. Also, I don’t know how. Help yourself.

How can I import large files and merge them?

So we have some files in a folder. All files have the same number of columns, the same column names, and are in the same format. I assume that data are tabular (i.e., in the shape of a rectangle defined by rows and columns). I also assume that data are stored as Comma-Separated Values (.csv) or Tab-separated Text (.txt or .tsv), as these formats are the most reproducible.

We to import all files and bind their rows together to form a unique long data frame. There are multiple combinations of functions we can use. Each function comes with a different package and does the job in different ways. Next, I will show some suggestions, but first let’s create some data. We are creating 50 datasets with 10 columns and 10,000 rows in .txt format. The variables included are numeric and made of 0s and 1s. There is also a column that identifies the data set. These files are created in a temporary directory using the temp.dir function for reproducibility. After closing you R session, this directory and all of its contents will disappear.

Base R: for loops

for loops are one of the fundamental skills in many programming languages. The idea behind for loops is quite intuitive: take a vector or list of length n, and apply a series of functions to each element in order. First, to element 1 and then to element 2, and so on, until we get to element n. Then, the loop ends. We will first make a vector with the paths of our files, and then apply the read.delim function to each element of the vector (i.e., to each path). Every time we import a file, we store the resulting data frame as an element of a list. After the loop finishes, we merge the rows of all element of the list using a combination of the functions do.call and rbind.

Base R: lapply

We will use the functions read.delim and read.csv in combination with the function lapply. The former are well known. The later is part of a family of functions (together with sapply, mapply, and some others I can’t remember) that take two arguments: a list and a function, which will be applied over each element of the list in parallel (i.e., in a vectorised way).

Tidyverse

The tidyverse is a family of packages that suggests a workflow when working in R. The use of pipes (%>%) is one of its signature moves, which allow you to chain several operations applied on the same object within the same block of code. In contrast, base R makes you choose between applying several functions to the same object in different blocks of code, or applying those functions in a nested way, so that the first functions you read are those applied the last to your object (e.g., do.call(rbind, as.list(data.frame(x = "this is annoying", y = 1:100)))). We will use a combination of the dplyr and purrr packages to import the files listed in a vector, using read.delim and bind_rows.

data.table

The function rbindlist function from the package data.table also allows to merge the datasets contained in a list. In combination with fread (from the same package), it can be very fast.

What method is the fastest?

I will compare how long each combination of importing, vectorising, and merging functions needs to import 50 data sets with 10 columns and 10,000 rows each. Additionally, I will compare the performance of each method when working with CSV (.csv) and TSV (.txt) files. For each method, I will repeat the process 100 times, measuring how long it takes from the moment we list the extant files in the folder to the moment we finish merging the data sets. Here are the results:

Figure 1: Mean time (and standard deviation) for each combination of methods and file formats across 100 replications

For more detail:

Table 1: Execution times
Time taken to import and merge
50 datasets with 10 columns and 10,000 rows each
package for loop lapply purrr::map
M SD M SD M SD
do.call - .csv
base 1.34 0.06 1.11 0.07 0.15 0.03
data.table 1.33 0.39 0.88 0.04 0.17 0.02
readr 1.38 0.05 1.07 0.13 0.15 0.02
dplyr::bind_rows - .csv
base 1.35 0.16 0.99 0.14 0.15 0.01
data.table 1.14 0.04 0.84 0.06 0.16 0.01
readr 1.25 0.14 0.91 0.04 0.14 0.00
do.call - .txt
base 1.18 0.04 0.88 0.04 0.17 0.02
data.table 1.17 0.04 0.80 0.05 0.15 0.02
readr 1.18 0.05 0.80 0.05 0.15 0.02
dplyr::bind_rows - .txt
base 1.13 0.03 0.84 0.06 0.20 0.02
data.table 1.19 0.13 0.77 0.03 0.14 0.01
readr 1.13 0.04 0.77 0.03 0.14 0.01
Mean 1.23 0.10 0.89 0.06 0.15 0.02

Figure 1 and Table Table 1 show the detailed timings The grand mean average time taken by all methods is ~2.12 seconds, but there are some differences.

  • It doesn’t really matter what function we use to merge data sets: both do.call and dplyr::bind_rows perform roughly similarly.
  • What makes the biggest difference is what function we use to vectorise the importing operation across file names to import them. purrr::map is the fastest. Incredibly, is takes less than 0.3 seconds in all conditions. It is also the least sensitive to the format of the files and the function we use to import them.
  • The next vectorising function in terms of temporal efficiency is lapply, which takes ~1.5 seconds. It performs slightly better when working with .txt files, in that when working with .csv files its performance depends on what method we use to import them: data.table::fread is much faster than its base and readr competitors. This post by Daniele Cook sheds some light into the advantage of data.table over other importing functions, also covering the vroom package, which this post doesn’t cover.
  • Using for loops looks like the least efficient method for iterating across data sets when importing data. It also shows a similar profile than lapply: data.table::fread performs a bit better than the rest.

Conclusion

Under the scenario under which I have simulated the data, it seems that using purrr::map in combination with do.call or dplyr::bind_rows to merge data sets is the most efficient method in terms of time. When using said combination, it doesn’t matter what function we use to import files, but data.table::fread seems like the best choice, as it is also the most flexible (take a look at the documentation of data.table to see all the features it offers).

If I have time, I may add another two dimensions: number of rows in the files and number of files, although I dare say similar results are to be expected. If anything, I would say that differences may become greater as file size and number of files increase. Also, it would be interesting to test if pre-allocating the elements of the vector in the for loop speeds up the process (see here what I mean). We shall see.

Hope this was useful, if not interesting!

Code

Session info

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
[3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Spanish_Spain.utf8    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] gt_0.8.0          ggsci_2.9         forcats_0.5.2     stringr_1.5.0    
 [5] readr_2.1.3       tidyr_1.2.1       tibble_3.1.8      ggplot2_3.4.0    
 [9] tidyverse_1.3.2   data.table_1.14.6 purrr_1.0.0       dplyr_1.0.10     
[13] quarto_1.2       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          lubridate_1.9.0     ps_1.7.2           
 [4] assertthat_0.2.1    digest_0.6.31       utf8_1.2.2         
 [7] R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
[10] reprex_2.0.2        evaluate_0.19       httr_1.4.4         
[13] pillar_1.8.1        rlang_1.0.6         googlesheets4_1.0.1
[16] readxl_1.4.1        rstudioapi_0.14     rmarkdown_2.19     
[19] labeling_0.4.2      googledrive_2.0.0   munsell_0.5.0      
[22] broom_1.0.2         compiler_4.2.2      modelr_0.1.10      
[25] xfun_0.36           pkgconfig_2.0.3     htmltools_0.5.4    
[28] tidyselect_1.2.0    fansi_1.0.3         crayon_1.5.2       
[31] tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
[34] later_1.3.0         commonmark_1.8.1    grid_4.2.2         
[37] jsonlite_1.8.4      gtable_0.3.1        lifecycle_1.0.3    
[40] DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
[43] cli_3.6.0           stringi_1.7.8       farver_2.1.1       
[46] renv_0.16.0         fs_1.5.2            xml2_1.3.3         
[49] ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.1        
[52] tools_4.2.2         glue_1.6.2          hms_1.1.2          
[55] processx_3.8.0      fastmap_1.1.0       yaml_2.3.6         
[58] timechange_0.1.1    colorspace_2.0-3    gargle_1.2.1       
[61] rvest_1.0.3         knitr_1.41          haven_2.5.1        
[64] sass_0.4.4         

Reuse

Citation

BibTeX citation:
@online{garcia-castro2020,
  author = {Garcia-Castro, Gonzalo},
  title = {Importing Data from Multiple Files Simultaneously in {R}},
  date = {2020-07-05},
  langid = {en}
}
For attribution, please cite this work as:
Garcia-Castro, G. (2020, July 5). Importing data from multiple files simultaneously in R.