Benchmark Downloading CSV-files

I recently had to import over 30 CSV-files from an online source into R and combine them. This got me asking: What is the fastest way to read several CSV-files from an online source into R?

I found several benchmarks online comparing different packages, functions and methods of importing CSV-files from disk to memory, such as this one from Appsilon, but I could not find any comparing the speed when importing from an online source.

In this post I will try to do just that.

Method

  • I will use open data from the City Bikes in Bergen, Norway, to compare the import speed. The CSV-files can be found here.
  • As internet download speed is not constant, I will download 6 files (half a year) for each function I test, and test the functions 100 times each so that the internet speed should average out.
  • I will only compare functions that can download the files directly, even though importing them and saving them as another file format might be faster.

Functions to compare

In this post, I will compare the following functions:

  • utils::read.csv() which is part of the pre-installed packages you get with R,
  • readr::read_csv() from the {tidyverse} family of packages, and
  • data.table::fread() which was the fastest of the three in Appsilon’s benchmark.

Benchmark

Setup

First, I define a vector of URLs to the files. I will download files from the first half of 2020.

I define a vector of the URLs I will download.

library(dplyr)

base_url <- "http://data.urbansharing.com/bergenbysykkel.no/trips/v1/2020/"

urls <- paste0(base_url, sprintf('%0.2d', seq(from = 1, to = 6)), ".csv")

urls[1:3]
## [1] "http://data.urbansharing.com/bergenbysykkel.no/trips/v1/2020/01.csv"
## [2] "http://data.urbansharing.com/bergenbysykkel.no/trips/v1/2020/02.csv"
## [3] "http://data.urbansharing.com/bergenbysykkel.no/trips/v1/2020/03.csv"

Next, I’ll define functions for each of the tests.

The function will comprise of using purrr::map_drf() to map over the URLs, using one of the functions to test for reading the CSVs, and returning a single data.frame by row-binding the data.

utils_read_csv <- function() purrr::map_dfr(urls, utils::read.csv) 
readr_read_csv <- function() purrr::map_dfr(urls, readr::read_csv) 
data_table_fread <- function() purrr::map_dfr(urls, data.table::fread) 

Running benchmark

Next, we’ll use microbenchmark::microbenchmark() to benchmark the functions.

benchmark <- microbenchmark::microbenchmark(
  utils_read_csv(),
  readr_read_csv(),
  data_table_fread(), 
  times = 100
)

Evaluate the benchmark

Let’s print and plot the results:

print(benchmark)
## Unit: seconds
##                expr      min       lq     mean   median       uq      max neval
##    utils_read_csv() 4.739795 5.036515 5.427946 5.309074 5.795855 6.505738   100
##    readr_read_csv() 5.773443 6.032951 6.479659 6.373512 6.661082 8.561153   100
##  data_table_fread() 4.067120 4.416386 4.839889 4.707098 5.093114 9.581562   100
benchmark %>% ggplot2::autoplot()

When comparing benchmarks, it is often most useful to compare the minimum time (what is the fastest the function can go) and the median (what is the most common amount of time).

As we can see from the benchmark, data_table_fread() had the shortest median time and the shortes time overall, and is the winner of the benchmark.

Thanks for reading this post!

Jan Petter Iversen
Jan Petter Iversen
Consultant

Data analytics consultant from Bergen, Norway.

comments powered by Disqus

Related