*If you missed Part 1 on the College Football Expected Points model fundamentals, you can **check it out here.** *

There have been some developments since I started talking about college football expected points a couple months ago, so let’s start with some housekeeping.

I have spent the past month or two working with the package creator on improving the capabilities and accuracy of the expected points model as well as the win probability model. Our initial roll-out of the updated package was used for Part 1 of the College Football Expected Points model fundamentals series.

version 2

— cfbscrapR (@cfbscrapR) April 19, 2020

We have a new version (2) of cfbscrapR! You can find it on github. It's a big improvement over the first version. That's what iterations all about though!

Got a team now too. Meet @SaiemGilani and @statsowar.

Peep the thread for all of the changes.

Previously in Part 1, we left off discussing field position and expected points, including a breakdown by down. Generally, what was presented could be considered the most top-level results discussion and model definition information without much depth into how we arrived at the model, or detail on how the predictions the model provides generate the expected points.

In order to remedy that, I feel it necessary for us to cover some model building methods. This article will show you how we arrive at the model that we currently have from a modeling perspective. The entire purpose of this series of articles is to show you as transparently as I can how the Expected Points model works, how well it works, and how the model is limited.

Additionally, and maybe most importantly. this research is **reproducible**. The following link contains an R notebook and supporting figures for the article which should be sufficient to work through (it is large, ~284Mb when unzipped). Link

**The Data**

We will be acquiring data from CollegeFootballData.com, courtesy of @CFB_data, using the cfbscrapR package, created by Meyappan Subbaiah (@msubbaiah1) and collaborators Saiem Gilani (@SaiemGilani) and Parker Fleming (@statsowar).

**Warning:***I have been **told** advised to give the readers an alert that things are about to get nerdy. This might be true, but the theme of the next two parts is motivation, both personal and intellectual. Do you want to learn and excel? Then continue. Otherwise, just scroll through the visuals and skip the technical math details. (Editor’s note: I jumped right to the pretty pictures, don’t be ashamed for admitting your lesser.)*

**Regression Methods**

Regression is a set of techniques for estimating relationships between multiple variables on a quantitative target variable, and our focus will be on one of the simplest types of relationships: linear.

**Linear Regression**

**Assumptions**

- Dependent variable is continuous, unbounded, and measured on an interval or ratio scale
- Model has linear relation between independent and dependent variables
- No outliers present
- Independence Assumption: Sample observations are independent
- Absence of multicollinearity between the predictor variables
- Constant Variance Assumption (homoscedasticity)
- Normal Distribution of error terms
- Little or no auto-correlation in the residuals

Since scores in football only happen in increments of 2, 3, and 6 (+0, +1, +2), with the additional points in parentheses resulting from extra point attempts, the football scoring scheme is not continuous over an interval or ratio scale without transformation of the target variable.

While pretending I was unaware of the continuous dependent variable assumption of linear regression, I took a look into producing a linear regression model using down, distance, and yards-to-goal as independent variables using a similarly treated college football dataset but excluding all 4th down plays. I tried to predict on either the next score in the half or points on the drive, and the one that demonstrated the highest adjusted R-squared at 0.5143, the others were quite low.

Adjusted R-squared is a measure of the percentage of the dependent variable variation explained by the independent variables, in this case 51.43%. Below is the model summary of this fitting with no intercept.

Additionally, the linear model indicates that all three factors are significant at a p=0.001 level, which is at least some evidence that our variables have a relationship with drive points. Here is a plot of the linear regression model fitting.

In the plot on the top left, there are 4 distinct scoring types clearly visible and the model is trying to fire a shot through to fit all of them. The red line in the top-left would be relatively flat if the residuals of the model fit had constant variance. While there are two other non-zero scoring types, Field Goals and Opponent Field Goals, the data excluded 4th down, so they do not appear on the plot. For clarity, there are in fact 7 next score types when we include the absence of a score, i.e. “No Score”, since the absence of a scoring event is also a type of next score.

Upon viewing these plots, I quickly realized that several assumptions of linear regression are being violated here, namely the constant variance assumption and normal distribution of error terms (see Figure 4), at minimum. We need to keep adding to our regression toolbox, so let us now take a look at a type of regression that does not restrict us to these assumptions.

**Logistic Regression**

Suppose we have a binary output variable Y, let’s say Y is a variable that gives a response of 1 if the next score in the half is a TD for the offense and a 0 otherwise. If we wanted to predict the probability that the next score in the half is a TD for the offense, one of the prime candidate models would be logistic regression.

**Assumptions**

- Binary logistic regression requires the dependent variable to be binary (i.e. 0/1)
- There is a linear relationship between the log-odds of the outcome and each of the predictor variables.
- No outliers present
- Independence Assumption: Sample observations are independent
- Absence of multicollinearity between the predictor variables
~~Constant Variance Assumption (homoscedasticity)~~~~Normal Distribution of error terms~~~~Little or no auto-correlation in the residuals~~

Now we are attempting to calculate the **probability** of the next scoring event directly, an essential component of an **expected** points model. Once we have a model capable of calculating the expectation of the scoring event, e.g. the probability of the next score being an offense touchdown, then we simply have to multiply the probability by the **point value of the score** to get the expected** points**.

With that background, we can build our model equation with whatever independent variables we choose to include in the model as shown in Figure 8.

Below is a model fit to the next score offense touchdown variable using the independent variables yards-to-goal, down, distance, and the interaction between down and distance.

Once again we see that all variables in the model are fit are significant at the *p* < 0.001 level. In figure 10 below, we can see the probability of the next score in half being a touchdown in relation to field position (yards-to-goal) as the offense progresses down the field.

You might be asking “Yeah, that’s great, but you’re still only predicting the probability of one scoring type relative to another. You’d need to do this like 6 times, right?” Well, fair point. How *does* one do logistic regression on a categorical variable that has more than one class?

**Multinomial Logistic Regression (or Softmax regression)**

A multinomial logistic regression model uses the independent (predictor) variables and target variable data from the training set to build relationships between the independent variables and **each of the classes of the target** variable.

The primary difference between logistic and multinomial logistic regression is the use of the softmax function which re-weights the probabilities generated from each of the individual models so that in total they add to one as seen in Figure 12 below.

More specifically, a multinomial logistic regression model is an extension of the binomial logistic regression model because it is a **series of logistic regression models** estimated simultaneously with the same reference outcome.

The college football expected points (EP) model is a multinomial logistic regression model which generates probabilities for the possible types of **next score** events within the same half. In our case, we build 6 logistic regression models fit to the next score types — Offense FG, Offense TD, Offense Safety, Opponent TD, Opponent FG, and Opponent Safety — all except for the class that is used as the base case (i.e. No Score), since that is accounted for in the intercept, as mentioned in Part 1. Ron Yurko, Sam Ventura, and Max Horowitz originally proposed the multinomial logistic regression expected points model for football in 2017, which we will learn more about next time.

Now that we have a way to calculate the probabilities of scores, we can calculate the expected points. We will talk about this in Part 3.