TidyTuesday - Beer Production - 2020 - Wk 10

This week’s effort will dive into beer production across America. You may find more informaiton at Alcohol and Tobacco Tax and Trade Bureau (TTB).

This EDA and time series analysis is not for quality, validity, or in line with typical time series analysis practice. I’m simply messing around with both the TSstudio and forecastLM packages… just to give them some love.

library(tidyverse)
library(janitor)
library(lubridate)
library(plotly)

# Read in the data. 
brewing_materials <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewing_materials.csv')
beer_taxed <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_taxed.csv')
brewer_size <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewer_size.csv')
beer_states <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv')

Let’s begin with a bit of data prep. I’m most interested in how this looks longitudinally, so I used make_date to bring together the year and month into a date variable–primarily for proper sorting.

hops1 <- brewing_materials %>%
  mutate(date = make_date(year, month)) %>%
  group_by(date, type) %>%
  summarize(avg = mean(month_current)) %>%
  filter(!type %in% c('Total Grain products', 'Other', 'Total Non-Grain products', 'Total Used')) 
time_series1 <- hops1 %>%
  ggplot(aes(x = date, y = avg, group = type)) + 
  geom_line(aes(color = type)) + 
  theme(axis.text.x = element_text(angle = 90))
time_series1

Clearly something does not look right after 2016. From what I can find, it appears they changed the reporting criteria or structure. Although I was looking for some type of diverging trend to indicate a change in types of beer popularity (my hypothesis), I’m not sure that is evident. The Hops (dry) variable obviously does something crazy in 2015-2016, but I’m not sure I trust that much of an increase beween 2014 and 2015 either.

Nevertheless, lets keep going with an effort now towards modeling this time series between 2008 and the end of 2013 (somewhat cleanish/valid data). Just to zoom in a bit, I’d like to examine the seasonality of Malt and malt products.

hops2 <- brewing_materials %>%
  mutate(date = make_date(year, month)) %>%
  filter(date >= '2008-01-01' & date <= '2013-12-01') %>%
  group_by(date, type) %>%  
  summarize(avg = mean(month_current)) %>% ungroup() %>%
  filter(type == 'Malt and malt products') %>%
  mutate(date = ymd(date, truncated = 1))

Let’s take a look with just this vantage.

time_series2 <- hops2 %>%
  ggplot(aes(x = date, y = avg, group = type)) + 
  geom_line(aes(color = type)) + 
  theme(axis.text.x = element_text(angle = 90))
time_series2

class(hops2)
## [1] "tbl_df"     "tbl"        "data.frame"

To model this, I’d like to push this time series into a tsibble to classify it as temporal data.

library(tsibble)
tsibble <- as_tsibble(hops2, index = date, key = type)
class(tsibble$date)
## [1] "Date"
head(tsibble$date)
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
## [6] "2008-06-01"

I fought with this for a good while. Lesson learned. Although this variable ts$date identifies as a ‘Date’ variable, it is not in the right or correct interval.

Lets take a look. Notice the structure of the date variable below, it is in `%Y-%m-%d’.

In the end, this forecasts daily, not monthly. We must coerce this variable’s interval as a month.

head(tsibble)
## # A tsibble: 6 x 3 [1D]
## # Key:       type [1]
##   date       type                         avg
##   <date>     <chr>                      <dbl>
## 1 2008-01-01 Malt and malt products 374165152
## 2 2008-02-01 Malt and malt products 355687578
## 3 2008-03-01 Malt and malt products 399855819
## 4 2008-04-01 Malt and malt products 388639443
## 5 2008-05-01 Malt and malt products 411307544
## 6 2008-06-01 Malt and malt products 415161326

In this instance, I knew I wanted to use the forecastLM developmental package which takes a tsibble or ts object. However, it does not have internal checks to determine your data align with the interval for which you’ve situated.

In my case, I did not originally situate an interval for the tsibble object I created and ran into problems during forecasting.

How to set a year month interval in tsibble using the yearmonth function:

tsibble <- tsibble %>%
  mutate(date = yearmonth(date))
class(tsibble$date)
## [1] "yearmonth" "Date"

This allows forecastLM to make predictions on a monthly basis, not daily…

class(tsibble)
## [1] "tbl_ts"     "tbl_df"     "tbl"        "data.frame"

Visualize using the TSstudio package from Rami Krispin, which takes a ts object as input for graphics. So lets make a ts object, as opposed to the tsibble.

library(TSstudio) 
seasonal <- ts(data = tsibble$avg, start = c(2008, 1), end = c(2013, 12), frequency = 12)
ts_plot(seasonal,
        line.mode = "lines+markers",
        Ygrid = T, slider = T,
        title = "Malts Over Time",
        Ytitle = "Millions of Barrels")

To better understand the seaonality, we can examine the autocorrelation and the partial autocorrelation functions from their correlograms (such a great name!).

ts_cor(seasonal, lag.max = 36)

In this case (considering approximate stationarity–I will not be venturing into differencing in this post), it is evident that this is an autocorrelated series with clear declining trend–as presented by the ACF plot’s clear decline over time. In this case, there seems to be a clear higher-order autoregressive term in the data.

To examine this term we can use the PACF plot, which looks at the residuals of the ACF (essentially), which suggests that most of the autocorrelation across seasons is simply because of the strong lag-1 autocorrelation. We will keep this in the model.

ts_seasonal(seasonal, type = 'all')

From this, it appears there is a strong sesonality–with a bit of obvious variability. However, there does appear to be a noticable decline across the entire year as time moves. This trend is also evident in the plot above this one, whereby there appears to be a fairly linear decline over time.

I’ll use a developmental R package forecastLM to examine the seasonality. You can find more details here.

Initial model.

library(forecastLM)
lm1 <- trainLM(input = tsibble,
               y = "avg",
               trend = list(linear = TRUE),
               seasonal = "month")
summary(lm1$model)
## 
## Call:
## stats::lm(formula = f, data = df1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -23430061  -8317270    439206   9092474  21751959 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  360070174    5173988  69.592  < 2e-16 ***
## monthFeb     -14381818    6706028  -2.145   0.0361 *  
## monthMar      32469918    6707026   4.841 9.68e-06 ***
## monthApr      28487043    6708689   4.246 7.81e-05 ***
## monthMay      44733730    6711016   6.666 9.93e-09 ***
## monthJun      49653838    6714006   7.396 5.78e-10 ***
## monthJul      41669588    6717660   6.203 5.94e-08 ***
## monthAug      29131656    6721974   4.334 5.79e-05 ***
## monthSep       3273486    6726950   0.487   0.6283    
## monthOct      -2585210    6732584  -0.384   0.7024    
## monthNov     -30450898    6738876  -4.519 3.04e-05 ***
## monthDec     -31600007    6745823  -4.684 1.70e-05 ***
## linear_trend   -701225      66790 -10.499 4.02e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11610000 on 59 degrees of freedom
## Multiple R-squared:  0.9032, Adjusted R-squared:  0.8835 
## F-statistic: 45.87 on 12 and 59 DF,  p-value: < 2.2e-16

To see how much the ar1 can help, we can put a lag in.

lm2  <- trainLM(input = tsibble,
               y = "avg",
               trend = list(linear = TRUE),
               seasonal = "month", 
               lags = c(1))
summary(lm2$model)
## 
## Call:
## stats::lm(formula = f, data = df1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -20266688  -7569525    480062   7395374  26354611 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.492e+08  4.140e+07   6.019 1.34e-07 ***
## monthFeb     -2.274e+07  7.957e+06  -2.858 0.005945 ** 
## monthMar      2.881e+07  7.126e+06   4.042 0.000160 ***
## monthApr      9.417e+06  1.075e+07   0.876 0.384768    
## monthMay      2.695e+07  1.036e+07   2.600 0.011851 *  
## monthJun      2.651e+07  1.200e+07   2.208 0.031256 *  
## monthJul      1.688e+07  1.252e+07   1.348 0.182990    
## monthAug      6.942e+06  1.169e+07   0.594 0.554926    
## monthSep     -1.482e+07  1.043e+07  -1.421 0.160833    
## monthOct     -1.222e+07  8.198e+06  -1.490 0.141637    
## monthNov     -3.819e+07  7.793e+06  -4.900 8.31e-06 ***
## monthDec     -3.021e+07  6.677e+06  -4.525 3.12e-05 ***
## linear_trend -4.461e+05  1.111e+05  -4.016 0.000175 ***
## lag_1         3.283e-01  1.256e-01   2.613 0.011447 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10970000 on 57 degrees of freedom
## Multiple R-squared:  0.9157, Adjusted R-squared:  0.8965 
## F-statistic: 47.65 on 13 and 57 DF,  p-value: < 2.2e-16
plot_res(lm2)

The forecastLM package allows us to account for transform, use a stepwise approach, and add in unique events or outliers. Although I dont show it here, there is also an option to take spline regression approach.

events <- list(outlier = c(as_date("2009-01-01"), as_date("2009-09-01"), as_date("2010-05-01"), as_date("2010-12-01"), as_date("2014-12-01"), as_date("2011-09-01")))
lm3  <- trainLM(input = seasonal,
               y = "avg",
               trend = list(log = TRUE),
               seasonal = "month", 
               lags = c(1),
               step = TRUE,
               events = events)
## Start:  AIC=2305.78
## seasonal ~ outlier + month + log_trend + lag_1
## 
##             Df  Sum of Sq        RSS    AIC
## <none>                    5.9128e+15 2305.8
## - lag_1      1 1.8686e+14 6.0997e+15 2306.0
## - outlier    1 2.2630e+14 6.1391e+15 2306.4
## - log_trend  1 2.6752e+15 8.5880e+15 2330.3
## - month     11 2.8860e+16 3.4773e+16 2409.6
summary(lm3$model)
## 
## Call:
## stats::lm(formula = seasonal ~ outlier + month + log_trend + 
##     lag_1, data = df1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -23297398  -6996325   1617514   6189965  23003813 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.284e+08  4.814e+07   6.821 6.82e-09 ***
## outlier      7.745e+06  5.290e+06   1.464 0.148788    
## monthFeb    -1.936e+07  7.494e+06  -2.584 0.012416 *  
## monthMar     3.080e+07  6.724e+06   4.580 2.64e-05 ***
## monthApr     1.921e+07  1.055e+07   1.821 0.073998 .  
## monthMay     3.520e+07  1.010e+07   3.486 0.000962 ***
## monthJun     3.882e+07  1.195e+07   3.247 0.001971 ** 
## monthJul     3.013e+07  1.253e+07   2.405 0.019524 *  
## monthAug     1.905e+07  1.166e+07   1.634 0.107799    
## monthSep    -7.180e+06  1.023e+07  -0.702 0.485840    
## monthOct    -5.984e+06  7.940e+06  -0.754 0.454228    
## monthNov    -3.288e+07  7.501e+06  -4.383 5.21e-05 ***
## monthDec    -3.056e+07  6.238e+06  -4.899 8.60e-06 ***
## log_trend   -1.459e+07  2.899e+06  -5.034 5.31e-06 ***
## lag_1        1.733e-01  1.303e-01   1.330 0.188805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10280000 on 56 degrees of freedom
## Multiple R-squared:  0.9274, Adjusted R-squared:  0.9093 
## F-statistic: 51.11 on 14 and 56 DF,  p-value: < 2.2e-16
plot_res(lm3)

Now that we have a prediction model that accounts for roughly 90% of the variability, we can look and see what this may have looked like with somewhat clean data or consistent reporting.

# Using the lm3 prediction model. 
fc3 <- forecastLM(lm3, h = 24)
# Plot it.
plot_fc(fc3, theme = 'classic')

Again, this was just a venture into using these two packages, not a full EDA or true statistical analysis. In all honesty, I do not often inspect seasonality or such trends in my own work, so I’m still learning.

Cheers.

Related