If you’re a heavy follower of football sports analytics twitter, you have perhaps seen terms like Offensive Rush EPA or Defensive Pass EPA. These terms are referring to an Expected Points Added metric (EPA) that attempts to quantitatively evaluate a play and return a magnitude and direction for the result of each play’s effect on the game in relation to the mean expectation for the game state.
The EPA metric used in this article relies on a model that breaks plays into down, distance and yard line groupings (or bins), and then estimating the actual points expected from each of the game states and subtracting that from the expected points added as a result of the play.
The purpose of this series will be to allow the fearless to get their hands on the data and perform some basic data manipulation to recreate some of the excellent sports analytics projects in the public sphere.
R & RStudio
This particular project will require the use of R and RStudio. You can follow the guide from the FSU Library for installation instructions. If you’ve never used R, do that and come back here when you’re done. You will want to create a folder in your documents folder like ‘NCAAFBepa’ and use that for your working directory. The quick rundown of it is that RStudio will want you to create a script to work in and you can do so by going to “File —> New File —> R Script” then save this file as “cfbscrapR_tutorial.R” All the code gists you see embedded on this page, you will want to copy them into this file, saving each time you input more code.
cfbscrapR is an R package for working with CFB data. It is an R API wrapper around https://collegefootballdata.com/. It provides users the capability to get a plethora of endpoints, and supplement that data with additional information (Expected Points Added/Win Probability added).
The Expected Points Added (EPA) Model
These “actual points expected” are determined using a Multinomial Logit Model (follow the link for further reading on this general type of model.) These models often take a number of hours/days to train on the number of seasons of data necessary for building an acceptable EPA and we will not be covering it directly in this article. If you are interested in understanding this EPA model more directly, you can take a look at @903124S’s GitHub repository titled CFB_EPA_data. The great thing about the cfbscrapR package, is that it includes these EPA calculations (optionally) when pulling play-by-play files from the CollegeFootballData.com API.
Most of this tutorial comes straight from Parker Fleming (@statowar) of SBNation’s TCU site, Frogs O‘ War, whose excellent tutorial can be found here and the model is explained further in his EPA Primer. Both are more thorough and cover things in a much finer grain, step-by-step detail.
A) Install the packages and set your working directory
Follow the directions in the comment at the top of this gist to install the necessary packages. If R asks for a restart to your session, allow it, then re-run from where the command left off. Be sure to set the path to your working directory with forward slashes.
Note: if you would simply like all of the code included below altogether, copy the script here.
2) Getting the play by play data for 2019
This section uses cfbscrapR’s cfb_pbp_data() function to pull each week’s data into a data.frame() object for easy tabular manipulation. This function performs a direct call to the College Football Data API. Understand that this means you’re using someone’s free service and it is considered, at minimum, poor form to make large repeated calls to an API, so do your best to only have to run this particular section once and then comment it out (save the data to disk if you would like and read it in as an alternative using read.csv().)
You should see something like this after this section finishes running:
This gives you a view of all the columns for the first 6 rows of the dataframe.
D) Data Manipulation
From here the script executes the following:
- select statement on the play-by-play data
- filter to only include rushing or passing plays
- groups all offensive plays together by team, including summary statistics for offensive passing EPA and offensive rushing EPA.
- does a similar grouping for the defensive plays, so that we can analyze both offensive and defensive EPA.
- limit the chart to teams in the following subsets: major teams on FSU’s schedule for 2020, teams in the ACC and Memphis.
Norvell) Visualization - Offensive Total - Passing EPA and Rushing EPA
This part of the script will first join the offensive and defensive dataframes together using a left join, meaning “offense_play” will only join rows with “defense_play” if there is an exact match and will not match with any values in “defense_play” that do not have an exact match in “offense_play”. The same kind of join is performed to bring in the logos, which will serve as markers on our scatter-plot.
Note: per Play and per Attempt are used interchangeably throughout, but the latter is a more accurate descriptor.
Ham) Visualization - Defensive Total - Passing EPA and Rushing EPA
Now, we simply perform the join in the opposite fashion to visualize the Defensive Passing and Rushing EPA.
Voila! Now we have a clear look at how FSU and their opponents fared in 2019 Offensive and Defensive Rush/Pass Expected Points Added per Play. Continue to play around with the code to filter for other teams, situations, field positions, etc.
Hopefully this gave you a quick look at what can be accomplished in mere minutes using the cfbscrapR package. If you’d like further examples and visualizations working with this data, please take a look at Parker Fleming’s tutorial or his EPA Primer. The following two tables are numerical outputs for the Offensive Passing/Rushing EPA charts above (if the visual does not make it clearer for you.)