TidyTuesday - NFL Data - 2020 - Wk 6
library(tidyverse)
library(janitor)
library(readxl)
#library(tidylog)
library(skimr)
library(knitr)
This week’s tidytuesday data comes from Pro Football Reference and includes attendance, standings, and game stats for each game. Well do a quick EDA and generate a few ideas of what might be interesting to look at. As a note, I stopped this one once I started to dive into predictives, as I simply ran out of time for this week. I’ll catch this up one day…. perhaps.
# Import the data from tidytuesday: https://github.com/rfordatascience/tidytuesday
attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')
First, lets take a look at the data and see what it can tell us at face value.
The attendance dataset
kable(head(attendance))
team | team_name | year | total | home | away | week | weekly_attendance |
---|---|---|---|---|---|---|---|
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 1 | 77434 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 2 | 66009 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 3 | NA |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 4 | 71801 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 5 | 66985 |
Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 6 | 44296 |
attendance %>%
skim()
Name | Piped data |
Number of rows | 10846 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2009.53 | 5.75 | 2000 | 2005.0 | 2010 | 2015.00 | 2019 | ▇▇▇▇▇ |
total | 0 | 1.00 | 1080910.03 | 72876.97 | 760644 | 1040509.0 | 1081090 | 1123230.00 | 1322087 | ▁▁▇▆▁ |
home | 0 | 1.00 | 540455.01 | 66774.65 | 202687 | 504360.0 | 543185 | 578342.00 | 741775 | ▁▁▅▇▁ |
away | 0 | 1.00 | 540455.01 | 25509.33 | 450295 | 524974.0 | 541757 | 557741.00 | 601655 | ▁▂▇▇▂ |
week | 0 | 1.00 | 9.00 | 4.90 | 1 | 5.0 | 9 | 13.00 | 17 | ▇▆▆▆▇ |
weekly_attendance | 638 | 0.94 | 67556.88 | 9022.02 | 23127 | 63245.5 | 68334 | 72544.75 | 105121 | ▁▁▇▃▁ |
We certainly have some interesting statistics here to mess with. It looks like the ‘total,’ ‘home,’ and ‘away’ attendance columns represent the entire season, while the ‘weekly_attendance’ field is for each week. Therefore, it looks like each row is a game. So there should be some variability in how many rows there are amongst the teams, as some make it to the playoff and some do not (therefore they’d have more games). A look at the summary statistics should tell us more… if that is the case or not.
So, as was expected, there does seem to be some variability associated with how many games they play. Not to worry, I’ll probably stay away from analyzing just that, and focus on either more aggregated descriptives or simply other statistics herein.
The standings dataset
kable(head(standings))
team | team_name | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Miami | Dolphins | 2000 | 11 | 5 | 323 | 226 | 97 | 6.1 | 1.0 | 7.1 | 0.0 | 7.1 | Playoffs | No Superbowl |
Indianapolis | Colts | 2000 | 10 | 6 | 429 | 326 | 103 | 6.4 | 1.5 | 7.9 | 7.1 | 0.8 | Playoffs | No Superbowl |
New York | Jets | 2000 | 9 | 7 | 321 | 321 | 0 | 0.0 | 3.5 | 3.5 | 1.4 | 2.2 | No Playoffs | No Superbowl |
Buffalo | Bills | 2000 | 8 | 8 | 315 | 350 | -35 | -2.2 | 2.2 | 0.0 | 0.5 | -0.5 | No Playoffs | No Superbowl |
New England | Patriots | 2000 | 5 | 11 | 276 | 338 | -62 | -3.9 | 1.4 | -2.5 | -2.7 | 0.2 | No Playoffs | No Superbowl |
Tennessee | Titans | 2000 | 13 | 3 | 346 | 191 | 155 | 9.7 | -1.3 | 8.3 | 1.5 | 6.8 | Playoffs | No Superbowl |
standings %>%
skim()
Name | Piped data |
Number of rows | 638 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
playoffs | 0 | 1 | 8 | 11 | 0 | 2 | 0 |
sb_winner | 0 | 1 | 12 | 13 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 2009.53 | 5.76 | 2000.0 | 2005.00 | 2010.0 | 2014.75 | 2019.0 | ▇▇▇▇▇ |
wins | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
loss | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
points_for | 0 | 1 | 350.28 | 71.40 | 161.0 | 299.00 | 348.0 | 396.00 | 606.0 | ▂▇▇▂▁ |
points_against | 0 | 1 | 350.28 | 59.55 | 165.0 | 310.00 | 347.0 | 391.50 | 517.0 | ▁▃▇▆▁ |
points_differential | 0 | 1 | 0.00 | 101.09 | -261.0 | -75.00 | 1.5 | 72.75 | 315.0 | ▂▆▇▅▁ |
margin_of_victory | 0 | 1 | 0.00 | 6.32 | -16.3 | -4.70 | 0.1 | 4.57 | 19.7 | ▂▆▇▅▁ |
strength_of_schedule | 0 | 1 | 0.00 | 1.63 | -4.6 | -1.10 | 0.0 | 1.20 | 4.3 | ▁▅▇▅▁ |
simple_rating | 0 | 1 | 0.00 | 6.20 | -17.4 | -4.47 | 0.0 | 4.50 | 20.1 | ▁▆▇▅▁ |
offensive_ranking | 0 | 1 | 0.00 | 4.34 | -11.7 | -3.18 | 0.0 | 2.70 | 15.9 | ▁▇▇▂▁ |
defensive_ranking | 0 | 1 | 0.00 | 3.57 | -9.8 | -2.40 | 0.1 | 2.50 | 9.8 | ▁▅▇▅▁ |
So, look at this snapshot of the data, there seems to be all kinds of interesting statistics between attendance and what appears to be by team year statistics.
The games dataset
kable(head(games))
year | week | home_team | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2000 | 1 | Minnesota Vikings | Chicago Bears | Minnesota Vikings | NA | Sun | September 3 | 13:00:00 | 30 | 27 | 374 | 1 | 425 | 1 | Vikings | Minnesota | Bears | Chicago |
2000 | 1 | Kansas City Chiefs | Indianapolis Colts | Indianapolis Colts | NA | Sun | September 3 | 13:00:00 | 27 | 14 | 386 | 2 | 280 | 1 | Chiefs | Kansas City | Colts | Indianapolis |
2000 | 1 | Washington Redskins | Carolina Panthers | Washington Redskins | NA | Sun | September 3 | 13:01:00 | 20 | 17 | 396 | 0 | 236 | 1 | Redskins | Washington | Panthers | Carolina |
2000 | 1 | Atlanta Falcons | San Francisco 49ers | Atlanta Falcons | NA | Sun | September 3 | 13:02:00 | 36 | 28 | 359 | 1 | 339 | 1 | Falcons | Atlanta | 49ers | San Francisco |
2000 | 1 | Pittsburgh Steelers | Baltimore Ravens | Baltimore Ravens | NA | Sun | September 3 | 13:02:00 | 16 | 0 | 336 | 0 | 223 | 1 | Steelers | Pittsburgh | Ravens | Baltimore |
2000 | 1 | Cleveland Browns | Jacksonville Jaguars | Jacksonville Jaguars | NA | Sun | September 3 | 13:02:00 | 27 | 7 | 398 | 0 | 249 | 1 | Browns | Cleveland | Jaguars | Jacksonville |
From this vantage, it looks like I could connect the team name from the attendance data to the home team of the games data. To do so, I’ll have to make a key that matches. Most readily, it looks like I’ll need to combine a few the ‘team’ and ‘team_name’ columns in the attendance data.
att_reshape <- attendance %>%
mutate(t_name = str_c(team, team_name, sep = " ")) %>%
select(-team, -team_name)
# to merge these datasets, I needed 'week' to be numeric on both...
# and for whatever reason it was a character field to begin with...
games <- games %>%
mutate(week = as.numeric(week))
att_games <- left_join(att_reshape, games,
by = c('t_name' = 'home_team', 'year' = 'year', 'week' = 'week'))
kable(head(att_games))
year | total | home | away | week | weekly_attendance | t_name | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2000 | 893926 | 387475 | 506451 | 1 | 77434 | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
2000 | 893926 | 387475 | 506451 | 2 | 66009 | Arizona Cardinals | Dallas Cowboys | Arizona Cardinals | NA | Sun | September 10 | 20:35:00 | 32 | 31 | 322 | 1 | 330 | 2 | Cardinals | Arizona | Cowboys | Dallas |
2000 | 893926 | 387475 | 506451 | 3 | NA | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
2000 | 893926 | 387475 | 506451 | 4 | 71801 | Arizona Cardinals | Green Bay Packers | Green Bay Packers | NA | Sun | September 24 | 16:06:00 | 29 | 3 | 455 | 1 | 209 | 4 | Cardinals | Arizona | Packers | Green Bay |
2000 | 893926 | 387475 | 506451 | 5 | 66985 | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
2000 | 893926 | 387475 | 506451 | 6 | 44296 | Arizona Cardinals | Cleveland Browns | Arizona Cardinals | NA | Sun | October 8 | 16:15:00 | 29 | 21 | 315 | 2 | 240 | 0 | Cardinals | Arizona | Browns | Cleveland |
At first glance it looks like there is a bunch of missing data, but because I merged on the home team, almost all the missingness appears to be related to the team is away or if they had a by week. Which makes it less alarming… I was alarmed at first… “did my join work?”
Nevertheless, I think we’re almost at a point to start visualizing stuff. It would be helpful to have a column identifying if a team is home or away… just so I can sort on it later.
att_games_ha <- att_games %>%
mutate(away_team_ind = if_else(is.na(away_team) & !is.na(weekly_attendance), 1, 0),
away_team_ind = case_when(
is.na(weekly_attendance) ~ 9999,
TRUE ~ away_team_ind))
So, by doing all of this, I can see trends across time of how well a team does in terms of attendance between home and away games.
Lets take a look.
chiefs <- att_games_ha %>%
filter(t_name == "Kansas City Chiefs")
chiefs_gg <- chiefs %>%
filter(away_team_ind != 9999) %>%
group_by(year, week, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>%
ggplot(aes(x = year, y = average_att)) +
geom_point() +
facet_wrap(~away_team_ind)
chiefs_gg
Well, that’s interesting… but I wonder if the obvious dip in at home attendance between ~2007 and ~2013 is simply a localized artifact to the Chiefs.
all_gg <- att_games_ha %>%
filter(away_team_ind != 9999) %>%
group_by(t_name, year, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>% ungroup() %>%
ggplot(aes(x = year, y = average_att, color = as.factor(away_team_ind))) +
geom_point() +
facet_wrap(~t_name) +
theme(legend.position = "top")
all_gg
So obviously, each team has a fairly unique pattern of attendance, especially between home and away games. You can also see where some teams got a new stadium that pushed attendance up, such as the Dallas Cowboys.
gg1 <- att_games_ha %>%
filter(away_team_ind != 9999) %>%
group_by(t_name, year, week, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>%
ggplot(aes(t_name, average_att, color = as.factor(away_team_ind))) +
geom_boxplot(outlier.alpha = .5) +
coord_flip() +
theme(legend.position = "top")
gg1
Although both the previous graph do not exhibit any clear reoccuring or dicernable trends, it makes me wonder what the relationship between home attendance and season performance is. Said another way, does a strong home team crowd during the first half of the season, relate to eventual success?
Lets take a look.
standings <- standings %>%
mutate(t_name = str_c(team, team_name, sep = " "))
predict_1 <- att_games_ha %>%
filter(!is.na(away_team),
week < 8) %>%
group_by(year, t_name) %>%
summarize(avg.att = mean(weekly_attendance)) %>% ungroup() %>%
left_join(standings, by = c("year", "t_name")) %>%
mutate(perc.win = wins/(wins+loss))
cor(predict_1$avg.att, predict_1$perc.win) #this is not promissing, but lets visualize
## [1] 0.09888602
predict_1_gg <- predict_1 %>%
ggplot(aes(x = perc.win, y = avg.att)) +
geom_point()
predict_1_gg
So much variability. I would assume this is because each team has different stadium sizes, and therefore the y-axis is not relative across teams. Lets see if we can control this.
Let’s first create a function to normalize data within group (year, team)
zscore <- function(m){
(m - mean(m)) / sd(m)
}
Group it and then lets see how things relate.
# Full season results were used, although home and away games' attendance will count for more than one team at a time.
predict_2 <- att_games_ha %>%
# filter(!is.na(away_team)) %>%
filter(!is.na(weekly_attendance)) %>%
group_by(year, t_name) %>%
mutate(avg.att = zscore(weekly_attendance)) %>% ungroup() %>%
left_join(standings, by = c("year", "t_name")) %>%
mutate(perc.win = wins/(wins+loss)) %>%
# filter(week < 8) %>%
# filter(week > 4 & week < 10) %>% #trying out different perspectives
# filter(week > 8) %>%
group_by(year, t_name, perc.win) %>%
summarize(avg_first_half = mean(avg.att)) %>% ungroup()
Lets take a look now.
predict_2_gg <- predict_2 %>%
ggplot(aes(x = perc.win, y = avg_first_half)) +
geom_point() +
geom_smooth(method = "auto")
# geom_point() +
# geom_smooth(stat = "smooth")
predict_2_gg
Although first half season attendance does not seem to be extensively predictive, perhaps there are other indicators that could help improve this model. Lets look a bit deeper here.
cor(predict_2$perc.win, predict_2$avg_first_half, method = "pearson")
## [1] 0.0536111
lm1 <- lm(data = predict_2, perc.win ~ avg_first_half)
broom::tidy(lm1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.01e- 1 7.66e- 3 65.4 1.72e-284
## 2 avg_first_half 8.59e+14 6.34e+14 1.35 1.76e- 1
broom::glance(lm1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.00287 0.00131 0.193 1.83 0.176 2 146. -286. -273.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
lm2 <- lm(data = predict_2, perc.win ~ year + avg_first_half + year*avg_first_half)
broom::tidy(lm2)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.43e- 1 2.70e+ 0 0.312 0.755
## 2 year -1.70e- 4 1.34e- 3 -0.126 0.899
## 3 avg_first_half 1.15e+17 2.34e+17 0.491 0.624
## 4 year:avg_first_half -5.67e+13 1.16e+14 -0.487 0.626
broom::glance(lm2)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.00325 -0.00147 0.193 0.689 0.559 4 146. -283. -260.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
predict_3 <- predict_2 %>%
left_join(standings, by = c("year", "t_name"))
lm3 <- lm(data = predict_3, perc.win ~ year + avg_first_half + points_for)
broom::tidy(lm3)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.60e+ 0 1.81e+ 0 4.75 2.55e- 6
## 2 year -4.39e- 3 9.04e- 4 -4.85 1.53e- 6
## 3 avg_first_half 4.38e+14 4.26e+14 1.03 3.04e- 1
## 4 points_for 2.03e- 3 7.29e- 5 27.9 2.68e-112
broom::glance(lm3)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.552 0.550 0.129 261. 3.77e-110 4 402. -793. -771.
## # … with 2 more variables: deviance <dbl>, df.residual <int>