Imputing missing values in time series

Last updated on 8. Dec 2020 9 min read R, Time series

I was recently looking into some open data from the City Bikes in Bergen, Norway, when I discovered something weird: data from the last day of each month was often missing. However, this proved to be the perfect opportunity to try a package I been wanting to test out for a while, namely {imputeTS}. Let’s try to impute the missing observations using this package.

Downloading the data

First, let’s import some packages we need.

library(dplyr)
library(imputeTS)
library(ggplot2)

Next, we’ll import the data.

The data is located here, and is stored in one CSV-file for each month. As of December 2020 there are 31 files. I found that the best way to download many CSV-files from the web is mapping over them with {purrr} and using fread from {data.table} (see my benchmark here).

Note that if you want to download only one or a few of these months, you should check out the package {bysykkel}, which was made for this.

base_url <- "http://data.urbansharing.com/bergenbysykkel.no/trips/v1/"

list_urls <- 
  tidyr::expand_grid(year = seq(2018, 2020), month = seq(1, 12)) %>% 
  # We only have data from june 2018 and onwards
  filter(!(year == 2018 & month < 6)) %>% 
  # I'll skip December 2020 also, as it only contains a few days
  filter(!(year == 2020 & month == 12)) %>% 
  mutate(url = paste0(base_url, year, "/", sprintf('%0.2d', month), ".csv")) %>% 
  pull(url)

df_raw <- purrr::map_dfr(list_urls, data.table::fread)

Let’s take a look at the structure of the data (normally I like to use skimr::skim(), but dplyr::glimpse() looks better online).

glimpse(df_raw)

## Rows: 2,029,260
## Columns: 13
## $ started_at                <dttm> 2018-06-29 10:45:12, 2018-06-29 10:53:59, …
## $ ended_at                  <dttm> 2018-06-29 11:05:10, 2018-06-29 11:11:28, …
## $ duration                  <int> 1197, 1048, 490, 478, 1779, 506, 518, 116, …
## $ start_station_id          <int> 3, 49, 58, 157, 58, 3, 132, 75, 157, 34, 83…
## $ start_station_name        <chr> "Grieghallen", "Studentsenteret UIB", "Tårn…
## $ start_station_description <chr> " ", " ", " ", "Høyteknologisenteret", " ",…
## $ start_station_latitude    <dbl> 60.38817, 60.38720, 60.39376, 60.38225, 60.…
## $ start_station_longitude   <dbl> 5.328335, 5.322980, 5.321792, 5.332332, 5.3…
## $ end_station_id            <int> 83, 75, 157, 83, 116, 82, 34, 75, 83, 58, 2…
## $ end_station_name          <chr> "Bergen jernbanestasjon", "Akvariet", "Høyt…
## $ end_station_description   <chr> " ", " ", "Høyteknologisenteret", " ", "Gam…
## $ end_station_latitude      <dbl> 60.39032, 60.39988, 60.38225, 60.39032, 60.…
## $ end_station_longitude     <dbl> 5.332440, 5.304686, 5.332332, 5.332440, 5.3…

As we can see, we have over 2 million rows, and 13 columns.

Clean data

As full days are missing in our data, we transform it to look at numbers of rides per day.

Further, to work with {imputeTS}, missing dates should be explicit. That is, they should be encoded as NA in our data, instead of not showing up. Thus, we use {tidyr} to make any missing values explicit.

df <- 
  df_raw %>%
  mutate(date_started = lubridate::as_date(started_at)) %>% 
  count(date_started) %>% 
  tidyr::complete(date_started = tidyr::full_seq(date_started, period = 1))

Locating missing values

OK, let’s see what {imputeTS} can tell us about this dataset.

statsNA(df$n)

## [1] "Length of time series:"
## [1] 885
## [1] "-------------------------"
## [1] "Number of Missing Values:"
## [1] 22
## [1] "-------------------------"
## [1] "Percentage of Missing Values:"
## [1] "2.49%"
## [1] "-------------------------"
## [1] "Number of Gaps:"
## [1] 22
## [1] "-------------------------"
## [1] "Average Gap Size:"
## [1] 1
## [1] "-------------------------"
## [1] "Stats for Bins"
## [1] "  Bin 1 (222 values from 1 to 222) :      1 NAs (0.45%)"
## [1] "  Bin 2 (222 values from 223 to 444) :      6 NAs (2.7%)"
## [1] "  Bin 3 (222 values from 445 to 666) :      8 NAs (3.6%)"
## [1] "  Bin 4 (219 values from 667 to 885) :      7 NAs (3.2%)"
## [1] "-------------------------"
## [1] "Longest NA gap (series of consecutive NAs)"
## [1] "1 in a row"
## [1] "-------------------------"
## [1] "Most frequent gap size (series of consecutive NA series)"
## [1] "1 NA in a row (occuring 22 times)"
## [1] "-------------------------"
## [1] "Gap size accounting for most NAs"
## [1] "1 NA in a row (occuring 22 times, making up for overall 22 NAs)"
## [1] "-------------------------"
## [1] "Overview NA series"
## [1] "  1 NA in a row: 22 times"

From this we can see that we have 22 missing values (or about 2.5~% of our data). The longest NA gap is 1 observation in a row, meaning that we never have two consecutive days of data missing.

Next, we plot the missing values using imputeTS::ggplot_na_distribution().

imputeTS::ggplot_na_distribution(
  x = df$n, 
  x_axis_labels = df$date_started,
  size_points = 1
)

I like this plot, but I don’t think the scales are the best for interpreting such a large time series. Therefore, I will adjust them. Since I’ll have to do this for all plots from {imputeTS}, I’ll make a small function for it.

adjust_scales <- function(gg) {
  
  gg + 
    scale_y_continuous(labels = scales::number) + 
    scale_x_date(
      breaks = pretty(df$date_started, n = 10), 
      labels = scales::date_format("%Y-%m")
    ) + 
    labs(x = "Date", y = "Number of rides") +
    theme(axis.text.x = element_text(angle = 30, hjust = 1))
  
}

Let’s plot the missing values again:

imputeTS::ggplot_na_distribution(
  x = df$n, 
  x_axis_labels = df$date_started,
  size_points = 1
) %>% 
  adjust_scales()

Apart from one day in 2018, the missing days seems to be rather regularly spaced from about halfway through 2019 and onward.

Impute the missing values

{imputeTS} offers the following algorithms for imputing missing values:

Function	Description
na_replace	Replace Missing Values by a Defined Value
na_random	Missing Value Imputation by Random Sample
na_mean	Missing Value Imputation by Mean Value
na_ma	Missing Value Imputation by Weighted Moving Average
na_interpolation	Missing Value Imputation by Interpolation
na_kalman	Missing Value Imputation by Kalman Smoothing
na_locf	Missing Value Imputation by Last Observation Carried Forward
na_seadec	Seasonally Decomposed Missing Value Imputation
na_seasplit	Seasonally Splitted Missing Value Imputation

In addition, one can use imputeTS::na_remove() to remove the missing values.

I will now go through them all and use the function ggplot_na_imputations() to plot the imputed values.

Since I will be using the same function over and over again with mostly the same options, I’ll write a wrapper around the function to avvoid repeating myself:

plot_imputations <- function(imputed_values) {
  
  ggplot_na_imputations(
    x_with_na = df$n, 
    x_with_imputations = imputed_values, 
    size_points = 1, 
    size_imputations = 2, 
    x_axis_labels = df$date_started
  ) %>% 
    adjust_scales()
  
}

Replace with defined value

Often NA means a specific value, such as 0 for numerical data. Let’s try replacing all missing values with 0.

imputeTS::na_replace(df$n, fill = 0) %>% 
  plot_imputations()

This doesn’t look probable for this data. There is no reason why there should be zero rides with the city bikes these days.

Random imputation

OK, if a specific value didn’t fit, how about a random one? na_random() imputes the missing values with a random value between the minimum and maximum of the non-missing data.

imputeTS::na_random(df$n) %>% 
  plot_imputations()

Not surprisingly, this don’t look good either.

Replacing missing time series observations with random values might work if your time series closely resembles noise, but apart from that it will probably not work.

Mean impute

Next, let’s try another simple imputation method: replacing missing values with the mean of the non-missing observations.

imputeTS::na_mean(df$n) %>% 
  plot_imputations()

Again, this doesn’t look good. Since the number of rides are clearly higher in 2019-2020 than in 2018, the mean overshoots in 2018 and undershoots most of the other time.

While replacing with the mean or median often works well with cross-sectional data, it seldom works with time series.

Moving average imputation

A type of imputation which should work better with time series is a moving average. With na_ma(), {imputeTS} imputes missing values with an average of the non-missing observations before and after the missing observation.

The function let’s you choose the size of the moving average and the weigthing, but I will keep to the default with a moving average of size 4 (2 observations before and 2 after the missing observation) and exponential weighting.

imputeTS::na_ma(df$n) %>% 
  plot_imputations()

As expected, this look quite well! This is mainly due to the type of time series I have. There seems to be strong seasonal effects in the data, and it probably is a weekly patterns as well. For this type of data, a moving average should work fine, but a window of size 7 might be better for capturing a full week for every point.

There is one thing I would like to point of with this method: it uses information about the future to impute the missing value. This is bad news if you are making a prediction model due to information leakage.

Last observation carried forward imputation

A method which is not using information from the future is na_locf() - Last Observation Carried Forward. This method assumes that a missing observation is equal to the day before.

imputeTS::na_locf(df$n) %>% 
  plot_imputations()

Often, assuming tomorrow will be the same as today is the best we can do. And for example in Financial time series this has been the theoretical approach for modeling stock prices for many years (see, e.g., chapter 3.1 in my master thesis for a review of this).

For this data, it seems to work fine, but I guess we could do better if we took the weekdays or other variables into account.

Interpolation imputation

The next method is interpolation. It assumes that a missing value is somewhere in between the observation before and after. The function gives several options for interpolation algorithm, but I will use the default: linear.

imputeTS::na_interpolation(df$n) %>% 
  plot_imputations()

This seems to work well, just as with the moving average. However, I have the same comments about using future values and information leakage.

Kalman Smoothing imputation

The next method uses Kalman Smoothing to impute the missing values. To be honest, I do not know much about this method, but it seems to be working well on this data:

imputeTS::na_kalman(df$n) %>% 
  plot_imputations()

Seasonally decomposed and seasonally splitted missing value imputation

The next two methods are quite similar, so I will only comment on them here. They both remove the seasonality from the time series, before performing imputation and then adding the seasonal component back in again.

na_seadec() removes the seasonality from the full time series, while na_seasplit() splits the time series into one series per season and impute the missing observations in the context of that season.

If your data is not a ts-object with a given frequency, one can use the argument find_frequancy = TRUE to make the algorithm estimate the frequency of the data.

imputeTS::na_seadec(df$n, find_frequency = TRUE) %>% 
  plot_imputations()

imputeTS::na_seasplit(df$n, find_frequency = TRUE) %>% 
  plot_imputations()

Both algorithms seems to be doing well on this data.

Conclusion

I liked using {imputeTS} a lot. The methods covers what I need and the visualizations are good. While I did not show it in the post, one can use the methods within dplyr::mutate() and similar functions, making them easy to work with.

R Time series