TidyTuesday - Beer Production - 2020 - Wk 10
This week’s effort will dive into beer production across America. You may find more informaiton at Alcohol and Tobacco Tax and Trade Bureau (TTB).
This EDA and time series analysis is not for quality, validity, or in line with typical time series analysis practice. I’m simply messing around with both the TSstudio
and forecastLM
packages… just to give them some love.
library(tidyverse)
library(janitor)
library(lubridate)
library(plotly)
# Read in the data.
brewing_materials <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewing_materials.csv')
beer_taxed <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_taxed.csv')
brewer_size <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewer_size.csv')
beer_states <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv')
Let’s begin with a bit of data prep. I’m most interested in how this looks longitudinally, so I used make_date
to bring together the year and month into a date variable–primarily for proper sorting.
hops1 <- brewing_materials %>%
mutate(date = make_date(year, month)) %>%
group_by(date, type) %>%
summarize(avg = mean(month_current)) %>%
filter(!type %in% c('Total Grain products', 'Other', 'Total Non-Grain products', 'Total Used'))
time_series1 <- hops1 %>%
ggplot(aes(x = date, y = avg, group = type)) +
geom_line(aes(color = type)) +
theme(axis.text.x = element_text(angle = 90))
time_series1
Clearly something does not look right after 2016. From what I can find, it appears they changed the reporting criteria or structure. Although I was looking for some type of diverging trend to indicate a change in types of beer popularity (my hypothesis), I’m not sure that is evident. The Hops (dry)
variable obviously does something crazy in 2015-2016, but I’m not sure I trust that much of an increase beween 2014 and 2015 either.
Nevertheless, lets keep going with an effort now towards modeling this time series between 2008 and the end of 2013 (somewhat cleanish/valid data). Just to zoom in a bit, I’d like to examine the seasonality of Malt and malt products
.
hops2 <- brewing_materials %>%
mutate(date = make_date(year, month)) %>%
filter(date >= '2008-01-01' & date <= '2013-12-01') %>%
group_by(date, type) %>%
summarize(avg = mean(month_current)) %>% ungroup() %>%
filter(type == 'Malt and malt products') %>%
mutate(date = ymd(date, truncated = 1))
Let’s take a look with just this vantage.
time_series2 <- hops2 %>%
ggplot(aes(x = date, y = avg, group = type)) +
geom_line(aes(color = type)) +
theme(axis.text.x = element_text(angle = 90))
time_series2
class(hops2)
## [1] "tbl_df" "tbl" "data.frame"
To model this, I’d like to push this time series into a tsibble
to classify it as temporal data.
library(tsibble)
tsibble <- as_tsibble(hops2, index = date, key = type)
class(tsibble$date)
## [1] "Date"
head(tsibble$date)
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
## [6] "2008-06-01"
I fought with this for a good while. Lesson learned. Although this variable ts$date
identifies as a ‘Date’ variable, it is not in the right or correct interval.
Lets take a look. Notice the structure of the date
variable below, it is in `%Y-%m-%d’.
In the end, this forecasts daily, not monthly. We must coerce this variable’s interval as a month.
head(tsibble)
## # A tsibble: 6 x 3 [1D]
## # Key: type [1]
## date type avg
## <date> <chr> <dbl>
## 1 2008-01-01 Malt and malt products 374165152
## 2 2008-02-01 Malt and malt products 355687578
## 3 2008-03-01 Malt and malt products 399855819
## 4 2008-04-01 Malt and malt products 388639443
## 5 2008-05-01 Malt and malt products 411307544
## 6 2008-06-01 Malt and malt products 415161326
In this instance, I knew I wanted to use the forecastLM
developmental package which takes a tsibble or ts object. However, it does not have internal checks to determine your data align with the interval for which you’ve situated.
In my case, I did not originally situate an interval for the tsibble object I created and ran into problems during forecasting.
How to set a year month
interval in tsibble using the yearmonth
function:
tsibble <- tsibble %>%
mutate(date = yearmonth(date))
class(tsibble$date)
## [1] "yearmonth" "Date"
This allows forecastLM to make predictions on a monthly basis, not daily…
class(tsibble)
## [1] "tbl_ts" "tbl_df" "tbl" "data.frame"
Visualize using the TSstudio package from Rami Krispin
, which takes a ts object as input for graphics. So lets make a ts object, as opposed to the tsibble.
library(TSstudio)
seasonal <- ts(data = tsibble$avg, start = c(2008, 1), end = c(2013, 12), frequency = 12)
ts_plot(seasonal,
line.mode = "lines+markers",
Ygrid = T, slider = T,
title = "Malts Over Time",
Ytitle = "Millions of Barrels")
To better understand the seaonality, we can examine the autocorrelation and the partial autocorrelation functions from their correlograms (such a great name!).
ts_cor(seasonal, lag.max = 36)
In this case (considering approximate stationarity–I will not be venturing into differencing in this post), it is evident that this is an autocorrelated series with clear declining trend–as presented by the ACF plot’s clear decline over time. In this case, there seems to be a clear higher-order autoregressive term in the data.
To examine this term we can use the PACF plot, which looks at the residuals of the ACF (essentially), which suggests that most of the autocorrelation across seasons is simply because of the strong lag-1 autocorrelation. We will keep this in the model.
ts_seasonal(seasonal, type = 'all')
From this, it appears there is a strong sesonality–with a bit of obvious variability. However, there does appear to be a noticable decline across the entire year as time moves. This trend is also evident in the plot above this one, whereby there appears to be a fairly linear decline over time.
I’ll use a developmental R package forecastLM
to examine the seasonality. You can find more details here.
Initial model.
library(forecastLM)
lm1 <- trainLM(input = tsibble,
y = "avg",
trend = list(linear = TRUE),
seasonal = "month")
summary(lm1$model)
##
## Call:
## stats::lm(formula = f, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23430061 -8317270 439206 9092474 21751959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 360070174 5173988 69.592 < 2e-16 ***
## monthFeb -14381818 6706028 -2.145 0.0361 *
## monthMar 32469918 6707026 4.841 9.68e-06 ***
## monthApr 28487043 6708689 4.246 7.81e-05 ***
## monthMay 44733730 6711016 6.666 9.93e-09 ***
## monthJun 49653838 6714006 7.396 5.78e-10 ***
## monthJul 41669588 6717660 6.203 5.94e-08 ***
## monthAug 29131656 6721974 4.334 5.79e-05 ***
## monthSep 3273486 6726950 0.487 0.6283
## monthOct -2585210 6732584 -0.384 0.7024
## monthNov -30450898 6738876 -4.519 3.04e-05 ***
## monthDec -31600007 6745823 -4.684 1.70e-05 ***
## linear_trend -701225 66790 -10.499 4.02e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11610000 on 59 degrees of freedom
## Multiple R-squared: 0.9032, Adjusted R-squared: 0.8835
## F-statistic: 45.87 on 12 and 59 DF, p-value: < 2.2e-16
To see how much the ar1 can help, we can put a lag in.
lm2 <- trainLM(input = tsibble,
y = "avg",
trend = list(linear = TRUE),
seasonal = "month",
lags = c(1))
summary(lm2$model)
##
## Call:
## stats::lm(formula = f, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20266688 -7569525 480062 7395374 26354611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.492e+08 4.140e+07 6.019 1.34e-07 ***
## monthFeb -2.274e+07 7.957e+06 -2.858 0.005945 **
## monthMar 2.881e+07 7.126e+06 4.042 0.000160 ***
## monthApr 9.417e+06 1.075e+07 0.876 0.384768
## monthMay 2.695e+07 1.036e+07 2.600 0.011851 *
## monthJun 2.651e+07 1.200e+07 2.208 0.031256 *
## monthJul 1.688e+07 1.252e+07 1.348 0.182990
## monthAug 6.942e+06 1.169e+07 0.594 0.554926
## monthSep -1.482e+07 1.043e+07 -1.421 0.160833
## monthOct -1.222e+07 8.198e+06 -1.490 0.141637
## monthNov -3.819e+07 7.793e+06 -4.900 8.31e-06 ***
## monthDec -3.021e+07 6.677e+06 -4.525 3.12e-05 ***
## linear_trend -4.461e+05 1.111e+05 -4.016 0.000175 ***
## lag_1 3.283e-01 1.256e-01 2.613 0.011447 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10970000 on 57 degrees of freedom
## Multiple R-squared: 0.9157, Adjusted R-squared: 0.8965
## F-statistic: 47.65 on 13 and 57 DF, p-value: < 2.2e-16
plot_res(lm2)
The forecastLM
package allows us to account for transform, use a stepwise approach, and add in unique events or outliers. Although I dont show it here, there is also an option to take spline regression approach.
events <- list(outlier = c(as_date("2009-01-01"), as_date("2009-09-01"), as_date("2010-05-01"), as_date("2010-12-01"), as_date("2014-12-01"), as_date("2011-09-01")))
lm3 <- trainLM(input = seasonal,
y = "avg",
trend = list(log = TRUE),
seasonal = "month",
lags = c(1),
step = TRUE,
events = events)
## Start: AIC=2305.78
## seasonal ~ outlier + month + log_trend + lag_1
##
## Df Sum of Sq RSS AIC
## <none> 5.9128e+15 2305.8
## - lag_1 1 1.8686e+14 6.0997e+15 2306.0
## - outlier 1 2.2630e+14 6.1391e+15 2306.4
## - log_trend 1 2.6752e+15 8.5880e+15 2330.3
## - month 11 2.8860e+16 3.4773e+16 2409.6
summary(lm3$model)
##
## Call:
## stats::lm(formula = seasonal ~ outlier + month + log_trend +
## lag_1, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23297398 -6996325 1617514 6189965 23003813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.284e+08 4.814e+07 6.821 6.82e-09 ***
## outlier 7.745e+06 5.290e+06 1.464 0.148788
## monthFeb -1.936e+07 7.494e+06 -2.584 0.012416 *
## monthMar 3.080e+07 6.724e+06 4.580 2.64e-05 ***
## monthApr 1.921e+07 1.055e+07 1.821 0.073998 .
## monthMay 3.520e+07 1.010e+07 3.486 0.000962 ***
## monthJun 3.882e+07 1.195e+07 3.247 0.001971 **
## monthJul 3.013e+07 1.253e+07 2.405 0.019524 *
## monthAug 1.905e+07 1.166e+07 1.634 0.107799
## monthSep -7.180e+06 1.023e+07 -0.702 0.485840
## monthOct -5.984e+06 7.940e+06 -0.754 0.454228
## monthNov -3.288e+07 7.501e+06 -4.383 5.21e-05 ***
## monthDec -3.056e+07 6.238e+06 -4.899 8.60e-06 ***
## log_trend -1.459e+07 2.899e+06 -5.034 5.31e-06 ***
## lag_1 1.733e-01 1.303e-01 1.330 0.188805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10280000 on 56 degrees of freedom
## Multiple R-squared: 0.9274, Adjusted R-squared: 0.9093
## F-statistic: 51.11 on 14 and 56 DF, p-value: < 2.2e-16
plot_res(lm3)
Now that we have a prediction model that accounts for roughly 90% of the variability, we can look and see what this may have looked like with somewhat clean data or consistent reporting.
# Using the lm3 prediction model.
fc3 <- forecastLM(lm3, h = 24)
# Plot it.
plot_fc(fc3, theme = 'classic')
Again, this was just a venture into using these two packages, not a full EDA or true statistical analysis. In all honesty, I do not often inspect seasonality or such trends in my own work, so I’m still learning.
Cheers.