TidyTuesday - NFL Data - 2020 - Wk 6

library(tidyverse)
library(janitor)
library(readxl)
#library(tidylog)
library(skimr)
library(knitr)

This week’s tidytuesday data comes from Pro Football Reference and includes attendance, standings, and game stats for each game. Well do a quick EDA and generate a few ideas of what might be interesting to look at. As a note, I stopped this one once I started to dive into predictives, as I simply ran out of time for this week. I’ll catch this up one day…. perhaps.

# Import the data from tidytuesday: https://github.com/rfordatascience/tidytuesday

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')

First, lets take a look at the data and see what it can tell us at face value.


The attendance dataset


kable(head(attendance))
teamteam_nameyeartotalhomeawayweekweekly_attendance
ArizonaCardinals2000893926387475506451177434
ArizonaCardinals2000893926387475506451266009
ArizonaCardinals20008939263874755064513NA
ArizonaCardinals2000893926387475506451471801
ArizonaCardinals2000893926387475506451566985
ArizonaCardinals2000893926387475506451644296
attendance %>%
  skim()
Table 1: Data summary
NamePiped data
Number of rows10846
Number of columns8
_______________________
Column type frequency:
character2
numeric6
________________________
Group variablesNone

Variable type: character

skim_variablen_missingcomplete_rateminmaxemptyn_uniquewhitespace
team015130320
team_name014100320

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
year01.002009.535.7520002005.020102015.002019▇▇▇▇▇
total01.001080910.0372876.977606441040509.010810901123230.001322087▁▁▇▆▁
home01.00540455.0166774.65202687504360.0543185578342.00741775▁▁▅▇▁
away01.00540455.0125509.33450295524974.0541757557741.00601655▁▂▇▇▂
week01.009.004.9015.0913.0017▇▆▆▆▇
weekly_attendance6380.9467556.889022.022312763245.56833472544.75105121▁▁▇▃▁

We certainly have some interesting statistics here to mess with. It looks like the ‘total,’ ‘home,’ and ‘away’ attendance columns represent the entire season, while the ‘weekly_attendance’ field is for each week. Therefore, it looks like each row is a game. So there should be some variability in how many rows there are amongst the teams, as some make it to the playoff and some do not (therefore they’d have more games). A look at the summary statistics should tell us more… if that is the case or not.

So, as was expected, there does seem to be some variability associated with how many games they play. Not to worry, I’ll probably stay away from analyzing just that, and focus on either more aggregated descriptives or simply other statistics herein.


The standings dataset


kable(head(standings))
teamteam_nameyearwinslosspoints_forpoints_againstpoints_differentialmargin_of_victorystrength_of_schedulesimple_ratingoffensive_rankingdefensive_rankingplayoffssb_winner
MiamiDolphins2000115323226976.11.07.10.07.1PlayoffsNo Superbowl
IndianapolisColts20001064293261036.41.57.97.10.8PlayoffsNo Superbowl
New YorkJets20009732132100.03.53.51.42.2No PlayoffsNo Superbowl
BuffaloBills200088315350-35-2.22.20.00.5-0.5No PlayoffsNo Superbowl
New EnglandPatriots2000511276338-62-3.91.4-2.5-2.70.2No PlayoffsNo Superbowl
TennesseeTitans20001333461911559.7-1.38.31.56.8PlayoffsNo Superbowl
standings %>% 
  skim()
Table 2: Data summary
NamePiped data
Number of rows638
Number of columns15
_______________________
Column type frequency:
character4
numeric11
________________________
Group variablesNone

Variable type: character

skim_variablen_missingcomplete_rateminmaxemptyn_uniquewhitespace
team015130320
team_name014100320
playoffs01811020
sb_winner011213020

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
year012009.535.762000.02005.002010.02014.752019.0▇▇▇▇▇
wins017.983.080.06.008.010.0016.0▂▆▇▆▂
loss017.983.080.06.008.010.0016.0▂▆▇▆▂
points_for01350.2871.40161.0299.00348.0396.00606.0▂▇▇▂▁
points_against01350.2859.55165.0310.00347.0391.50517.0▁▃▇▆▁
points_differential010.00101.09-261.0-75.001.572.75315.0▂▆▇▅▁
margin_of_victory010.006.32-16.3-4.700.14.5719.7▂▆▇▅▁
strength_of_schedule010.001.63-4.6-1.100.01.204.3▁▅▇▅▁
simple_rating010.006.20-17.4-4.470.04.5020.1▁▆▇▅▁
offensive_ranking010.004.34-11.7-3.180.02.7015.9▁▇▇▂▁
defensive_ranking010.003.57-9.8-2.400.12.509.8▁▅▇▅▁

So, look at this snapshot of the data, there seems to be all kinds of interesting statistics between attendance and what appears to be by team year statistics.


The games dataset


kable(head(games))
yearweekhome_teamaway_teamwinnertiedaydatetimepts_winpts_lossyds_winturnovers_winyds_lossturnovers_losshome_team_namehome_team_cityaway_team_nameaway_team_city
20001Minnesota VikingsChicago BearsMinnesota VikingsNASunSeptember 313:00:00302737414251VikingsMinnesotaBearsChicago
20001Kansas City ChiefsIndianapolis ColtsIndianapolis ColtsNASunSeptember 313:00:00271438622801ChiefsKansas CityColtsIndianapolis
20001Washington RedskinsCarolina PanthersWashington RedskinsNASunSeptember 313:01:00201739602361RedskinsWashingtonPanthersCarolina
20001Atlanta FalconsSan Francisco 49ersAtlanta FalconsNASunSeptember 313:02:00362835913391FalconsAtlanta49ersSan Francisco
20001Pittsburgh SteelersBaltimore RavensBaltimore RavensNASunSeptember 313:02:0016033602231SteelersPittsburghRavensBaltimore
20001Cleveland BrownsJacksonville JaguarsJacksonville JaguarsNASunSeptember 313:02:0027739802491BrownsClevelandJaguarsJacksonville

From this vantage, it looks like I could connect the team name from the attendance data to the home team of the games data. To do so, I’ll have to make a key that matches. Most readily, it looks like I’ll need to combine a few the ‘team’ and ‘team_name’ columns in the attendance data.

att_reshape <- attendance %>% 
  mutate(t_name = str_c(team, team_name, sep = " ")) %>%
  select(-team, -team_name)

# to merge these datasets, I needed 'week' to be numeric on both... 
# and for whatever reason it was a character field to begin with...
games <- games %>%
  mutate(week = as.numeric(week))

att_games <- left_join(att_reshape, games, 
                       by = c('t_name' = 'home_team', 'year' = 'year', 'week' = 'week'))

kable(head(att_games))
yeartotalhomeawayweekweekly_attendancet_nameaway_teamwinnertiedaydatetimepts_winpts_lossyds_winturnovers_winyds_lossturnovers_losshome_team_namehome_team_cityaway_team_nameaway_team_city
2000893926387475506451177434Arizona CardinalsNANANANANANANANANANANANANANANANA
2000893926387475506451266009Arizona CardinalsDallas CowboysArizona CardinalsNASunSeptember 1020:35:00323132213302CardinalsArizonaCowboysDallas
20008939263874755064513NAArizona CardinalsNANANANANANANANANANANANANANANANA
2000893926387475506451471801Arizona CardinalsGreen Bay PackersGreen Bay PackersNASunSeptember 2416:06:0029345512094CardinalsArizonaPackersGreen Bay
2000893926387475506451566985Arizona CardinalsNANANANANANANANANANANANANANANANA
2000893926387475506451644296Arizona CardinalsCleveland BrownsArizona CardinalsNASunOctober 816:15:00292131522400CardinalsArizonaBrownsCleveland

At first glance it looks like there is a bunch of missing data, but because I merged on the home team, almost all the missingness appears to be related to the team is away or if they had a by week. Which makes it less alarming… I was alarmed at first… “did my join work?”

Nevertheless, I think we’re almost at a point to start visualizing stuff. It would be helpful to have a column identifying if a team is home or away… just so I can sort on it later.

att_games_ha <- att_games %>%
  mutate(away_team_ind = if_else(is.na(away_team) & !is.na(weekly_attendance), 1, 0),
         away_team_ind = case_when(
           is.na(weekly_attendance) ~ 9999,
           TRUE ~ away_team_ind))

So, by doing all of this, I can see trends across time of how well a team does in terms of attendance between home and away games.

Lets take a look.

chiefs <- att_games_ha %>%
  filter(t_name == "Kansas City Chiefs") 

chiefs_gg <- chiefs %>%
  filter(away_team_ind != 9999) %>%
  group_by(year, week, away_team_ind) %>%
  summarize(average_att = mean(weekly_attendance)) %>%
  ggplot(aes(x = year, y = average_att)) +
  geom_point() + 
  facet_wrap(~away_team_ind)

chiefs_gg

Well, that’s interesting… but I wonder if the obvious dip in at home attendance between ~2007 and ~2013 is simply a localized artifact to the Chiefs.

all_gg <- att_games_ha %>%
  filter(away_team_ind != 9999) %>%
  group_by(t_name, year, away_team_ind) %>%
  summarize(average_att = mean(weekly_attendance)) %>% ungroup() %>%
  ggplot(aes(x = year, y = average_att, color = as.factor(away_team_ind))) +
  geom_point() + 
  facet_wrap(~t_name) + 
  theme(legend.position = "top")

all_gg

So obviously, each team has a fairly unique pattern of attendance, especially between home and away games. You can also see where some teams got a new stadium that pushed attendance up, such as the Dallas Cowboys.

gg1 <- att_games_ha %>%
  filter(away_team_ind != 9999) %>%
  group_by(t_name, year, week, away_team_ind) %>%
  summarize(average_att = mean(weekly_attendance)) %>%
  ggplot(aes(t_name, average_att, color = as.factor(away_team_ind))) +
  geom_boxplot(outlier.alpha = .5) + 
  coord_flip() + 
  theme(legend.position = "top")

gg1

Although both the previous graph do not exhibit any clear reoccuring or dicernable trends, it makes me wonder what the relationship between home attendance and season performance is. Said another way, does a strong home team crowd during the first half of the season, relate to eventual success?

Lets take a look.

standings <- standings %>%
  mutate(t_name = str_c(team, team_name, sep = " "))
  
predict_1 <- att_games_ha %>%
  filter(!is.na(away_team),
         week < 8) %>%
  group_by(year, t_name) %>%
  summarize(avg.att = mean(weekly_attendance)) %>% ungroup() %>%
  left_join(standings, by = c("year", "t_name")) %>%
  mutate(perc.win = wins/(wins+loss))

cor(predict_1$avg.att, predict_1$perc.win) #this is not promissing, but lets visualize
## [1] 0.09888602
predict_1_gg <- predict_1 %>%
  ggplot(aes(x = perc.win, y = avg.att)) + 
  geom_point() 

predict_1_gg

So much variability. I would assume this is because each team has different stadium sizes, and therefore the y-axis is not relative across teams. Lets see if we can control this.

Let’s first create a function to normalize data within group (year, team)

zscore <- function(m){
  (m - mean(m)) / sd(m)
}

Group it and then lets see how things relate.

# Full season results were used, although home and away games' attendance will count for more than one team at a time. 

predict_2 <- att_games_ha %>%
  # filter(!is.na(away_team)) %>%
  filter(!is.na(weekly_attendance)) %>%
  group_by(year, t_name) %>%
  mutate(avg.att = zscore(weekly_attendance)) %>% ungroup() %>%
  left_join(standings, by = c("year", "t_name")) %>%
  mutate(perc.win = wins/(wins+loss)) %>%
  # filter(week < 8) %>%
  # filter(week > 4 & week < 10) %>% #trying out different perspectives
  # filter(week > 8) %>%
  group_by(year, t_name, perc.win) %>%
  summarize(avg_first_half = mean(avg.att)) %>% ungroup()

Lets take a look now.

predict_2_gg <- predict_2 %>%
  ggplot(aes(x = perc.win, y = avg_first_half)) + 
  geom_point() + 
  geom_smooth(method = "auto")
  
  # geom_point() + 
  # geom_smooth(stat = "smooth")

predict_2_gg

Although first half season attendance does not seem to be extensively predictive, perhaps there are other indicators that could help improve this model. Lets look a bit deeper here.

cor(predict_2$perc.win, predict_2$avg_first_half, method = "pearson")
## [1] 0.0536111
lm1 <- lm(data = predict_2, perc.win ~ avg_first_half)
broom::tidy(lm1)
## # A tibble: 2 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    5.01e- 1  7.66e- 3     65.4  1.72e-284
## 2 avg_first_half 8.59e+14  6.34e+14      1.35 1.76e-  1
broom::glance(lm1)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1   0.00287       0.00131 0.193      1.83   0.176     2   146. -286. -273.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
lm2 <- lm(data = predict_2, perc.win ~ year + avg_first_half + year*avg_first_half)
broom::tidy(lm2)
## # A tibble: 4 x 5
##   term                 estimate std.error statistic p.value
##   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)          8.43e- 1  2.70e+ 0     0.312   0.755
## 2 year                -1.70e- 4  1.34e- 3    -0.126   0.899
## 3 avg_first_half       1.15e+17  2.34e+17     0.491   0.624
## 4 year:avg_first_half -5.67e+13  1.16e+14    -0.487   0.626
broom::glance(lm2)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1   0.00325      -0.00147 0.193     0.689   0.559     4   146. -283. -260.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
predict_3 <- predict_2 %>%
  left_join(standings, by = c("year", "t_name"))

lm3 <- lm(data = predict_3, perc.win ~ year + avg_first_half + points_for)
broom::tidy(lm3)
## # A tibble: 4 x 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     8.60e+ 0  1.81e+ 0      4.75 2.55e-  6
## 2 year           -4.39e- 3  9.04e- 4     -4.85 1.53e-  6
## 3 avg_first_half  4.38e+14  4.26e+14      1.03 3.04e-  1
## 4 points_for      2.03e- 3  7.29e- 5     27.9  2.68e-112
broom::glance(lm3)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.552         0.550 0.129      261. 3.77e-110     4   402. -793. -771.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

Related