Filed under:

# Series on Sports Analytics: College Football Expected Points Model Fundamentals, Part 2

A lesson in regression

If you missed Part 1 on the College Football Expected Points model fundamentals, you can check it out here.

There have been some developments since I started talking about college football expected points a couple months ago, so let’s start with some housekeeping.

I have spent the past month or two working with the package creator on improving the capabilities and accuracy of the expected points model as well as the win probability model. Our initial roll-out of the updated package was used for Part 1 of the College Football Expected Points model fundamentals series.

Previously in Part 1, we left off discussing field position and expected points, including a breakdown by down. Generally, what was presented could be considered the most top-level results discussion and model definition information without much depth into how we arrived at the model, or detail on how the predictions the model provides generate the expected points.

In order to remedy that, I feel it necessary for us to cover some model building methods. This article will show you how we arrive at the model that we currently have from a modeling perspective. The entire purpose of this series of articles is to show you as transparently as I can how the Expected Points model works, how well it works, and how the model is limited.

Additionally, and maybe most importantly. this research is reproducible. The following link contains an R notebook and supporting figures for the article which should be sufficient to work through (it is large, ~284Mb when unzipped). Link

### The Data

We will be acquiring data from CollegeFootballData.com, courtesy of @CFB_data, using the cfbscrapR package, created by Meyappan Subbaiah (@msubbaiah1) and collaborators Saiem Gilani (@SaiemGilani) and Parker Fleming (@statsowar).

Warning: I have been told advised to give the readers an alert that things are about to get nerdy. This might be true, but the theme of the next two parts is motivation, both personal and intellectual. Do you want to learn and excel? Then continue. Otherwise, just scroll through the visuals and skip the technical math details. (Editor’s note: I jumped right to the pretty pictures, don’t be ashamed for admitting your lesser.)

## Regression Methods

Regression is a set of techniques for estimating relationships between multiple variables on a quantitative target variable, and our focus will be on one of the simplest types of relationships: linear.

### Linear Regression

#### Assumptions

• Dependent variable is continuous, unbounded, and measured on an interval or ratio scale
• Model has linear relation between independent and dependent variables
• No outliers present
• Independence Assumption: Sample observations are independent
• Absence of multicollinearity between the predictor variables
• Constant Variance Assumption (homoscedasticity)
• Normal Distribution of error terms
• Little or no auto-correlation in the residuals

Since scores in football only happen in increments of 2, 3, and 6 (+0, +1, +2), with the additional points in parentheses resulting from extra point attempts, the football scoring scheme is not continuous over an interval or ratio scale without transformation of the target variable.

While pretending I was unaware of the continuous dependent variable assumption of linear regression, I took a look into producing a linear regression model using down, distance, and yards-to-goal as independent variables using a similarly treated college football dataset but excluding all 4th down plays. I tried to predict on either the next score in the half or points on the drive, and the one that demonstrated the highest adjusted R-squared at 0.5143, the others were quite low.

Adjusted R-squared is a measure of the percentage of the dependent variable variation explained by the independent variables, in this case 51.43%. Below is the model summary of this fitting with no intercept. Figure 3: Linear Regression model summary | Expected Drive-Points model using Down, Distance, and Field Position as factors

Additionally, the linear model indicates that all three factors are significant at a p=0.001 level, which is at least some evidence that our variables have a relationship with drive points. Here is a plot of the linear regression model fitting. Figure 4: Linear Regression model plots | Expected Drive-Points model using Down, Distance, and Field Position as factors

In the plot on the top left, there are 4 distinct scoring types clearly visible and the model is trying to fire a shot through to fit all of them. The red line in the top-left would be relatively flat if the residuals of the model fit had constant variance. While there are two other non-zero scoring types, Field Goals and Opponent Field Goals, the data excluded 4th down, so they do not appear on the plot. For clarity, there are in fact 7 next score types when we include the absence of a score, i.e. “No Score”, since the absence of a scoring event is also a type of next score.

Upon viewing these plots, I quickly realized that several assumptions of linear regression are being violated here, namely the constant variance assumption and normal distribution of error terms (see Figure 4), at minimum. We need to keep adding to our regression toolbox, so let us now take a look at a type of regression that does not restrict us to these assumptions.

### Logistic Regression

Suppose we have a binary output variable Y, let’s say Y is a variable that gives a response of 1 if the next score in the half is a TD for the offense and a 0 otherwise. If we wanted to predict the probability that the next score in the half is a TD for the offense, one of the prime candidate models would be logistic regression.

#### Assumptions

• Binary logistic regression requires the dependent variable to be binary (i.e. 0/1)
• There is a linear relationship between the log-odds of the outcome and each of the predictor variables.
• No outliers present
• Independence Assumption: Sample observations are independent
• Absence of multicollinearity between the predictor variables
• Constant Variance Assumption (homoscedasticity)
• Normal Distribution of error terms
• Little or no auto-correlation in the residuals

Now we are attempting to calculate the probability of the next scoring event directly, an essential component of an expected points model. Once we have a model capable of calculating the expectation of the scoring event, e.g. the probability of the next score being an offense touchdown, then we simply have to multiply the probability by the point value of the score to get the expected points.

With that background, we can build our model equation with whatever independent variables we choose to include in the model as shown in Figure 8.

Below is a model fit to the next score offense touchdown variable using the independent variables yards-to-goal, down, distance, and the interaction between down and distance.

Once again we see that all variables in the model are fit are significant at the p < 0.001 level. In figure 10 below, we can see the probability of the next score in half being a touchdown in relation to field position (yards-to-goal) as the offense progresses down the field.

You might be asking “Yeah, that’s great, but you’re still only predicting the probability of one scoring type relative to another. You’d need to do this like 6 times, right?” Well, fair point. How does one do logistic regression on a categorical variable that has more than one class?

### Multinomial Logistic Regression (or Softmax regression)

A multinomial logistic regression model uses the independent (predictor) variables and target variable data from the training set to build relationships between the independent variables and each of the classes of the target variable.