If you missed part 1 on the College Football Expected Points model fundamentals, you can check it out here. For part 2, click here.
Previously in part 1, we left off discussing field position and expected points, including a breakdown by down. Generally, what was presented could be considered the most top-level results discussion and model definition information. In part 2, we went into some detail about regression model building methods and how the model generates the next score probabilities, with some depth into how we arrived at the model from a technical perspective. Today, I feel it necessary for us to cover some history of the expected points model in football. This article will show you how we arrive at the model factors that we currently have from a historical perspective.
The entire purpose of these series of articles is to show you as transparently as I can how the College Football Expected Points model works, how well it works, and how the model is limited. Additionally, this research is reproducible.
We will be acquiring data from CollegeFootballData.com, courtesy of @CFB_data, using the cfbscrapR package, created by Meyappan Subbaiah (@msubbaiah1) and collaborators Saiem Gilani (@SaiemGilani) and Parker Fleming (@statsowar).
An Abridged History of Expected Points Models in Football
Field Position models
One of the earliest discussions and attempts at building an expected points model in football was made by Virgil Carter of the Cincinatti Bengals in his graduate research paper, “Operations Research on Football”. Carter found some time to write the research paper while pursuing his MBA at Northwestern in the off-season from his main gig of being quarterback for the Chicago Bears.
For the study, Carter and his wife Judy coded 8,373 plays with 53 variables — including time, down, and distance — from 56 games played during the first half of the 1969 season. From the paper:
... the field was divided into ten strips, namely, 99 to 01 yards to go, 90 to 81 yards to go, 80 to 71 yards to go, etc. These data are identified by their midpoints in Table I. The smallest number of data points in any set was 57 (this set centered on 95 yards to go), and the largest was 601 (this set centered on 75 yards to go)., This condensation led to a system of ten equations in 10 unknowns.
The analysis was based on a study of 2,852 first-and-ten plays. An independent calculation was performed for the subset of 1,258 first downs immediately following a turnover (i.e. the start of a series), and the average absolute difference was found to be less than a quarter of a point. (Carter and Machol, 1971)
This smaller dataset is used as the basis for the model described in the paper and we will show the results graphically in the figure below.
For complete clarity, the study averaged the value of the next scores for each of the 10 strips listed to give the data in the table.
In 1988, the book Hidden Game of Football by Bob Carroll, Pete Palmer, and John Thorn introduced a straight-line model of valuing field position in two point increments.
In other words, starting from your own 0 (or 100 yards-to-goal) is worth -2 expected points, possession at the 20-yard line is worth 0 expected points, and possession at the 40-yard line is worth 2 points, etc. until you get to the opponent’s goal line, where your expected points value is 6. This is a simplified model based in principle on the Carter model above.
The figure below is a visual comparison of the two models.
We have looked at the Carter model, which only looked at field position for a specific down and distance situation (namely, 1st-and-10) and modeled expected points as the average of the points scored in each of the 10 field position bins.
We additionally covered a model that ignored down and distance altogether and modeled the expected points as a function of field position. Now, let’s look at some of the more recent attempts at modeling that include both down and distance, in addition to field position.
Down-Distance-Field Position models
Belur V. Dasarthy wrote a paper in 1991 with the three above factors using a nearest-neighbor approach and averaging the next points scored that I have not been able to get my hands on.
Some time in 2009 or earlier, Brian Burke posited that every down-distance-field position combination had an average net point advantage. He built a model demonstrating as such with a smoothing technique called locally estimated scatterplot smoothing (LOESS) but not much else in terms of detail provided.
You can read more about his model at the link above, but among the noted limitations with the model were that it only included first and third quarter data since he correctly noted end-of-half drive plays was a factor that needed to be accounted for.
His other consideration was that he limited the data to plays where the score differential was within 10 points to limit the effect of garbage time.
Trey Causey did a quality write-up on a very similar style model and has done some interesting nearest-neighbors player similarity modeling that is worth reading and, at the time, was reproducible.
So all of these models are curve fitting attempts that lead us to the topic of regression. Let us survey different types of regression before continuing.
Game-State Situational Probability models
Alok Pattani’s 2012 writeup on ESPN’s NFL Expected Points model, which additionally included home-field advantage and time remaining, left a lot of the detail out, giving few reproducible details.
However, in subsequent years, more details regarding the technical specifics of ESPN’s Bayesian hierarchical statistical expected points model were discussed in an article on the PlayStation Player Impact Rating.
Additionally, Kenneth Goldner wrote a paper in 2017 about using a Markov modeling of a possession with stochastic processes which I have been unable to get my hands on. EDIT: A very kind Columbia professor saw this article and sent me the paper. Among other things, the paper demonstrates that on 1st and 10 at any point on the field, the expected points is positive.
In 2017, Ron Yurko, Sam Ventura, and Max Horowitz from the Carnegie Mellon University Statistics department started presenting “NFL Player Evaluation Using Expected Points Added with nflscrapR” at conferences and ultimately published the nflWAR paper in the Journal of Quantitative Analysis in Sports.
(That is a seminal work in football analytics and if you were to only click one link to read more on, that is the paper you should look at.)
The current cfbscrapR model is providing expected points using the same model fit to college football data and with very similar adjustments.
Time to talk about the cfbscrapR model, finally!
Recall the following figure from part 2.
The college football expected points (EP) model is a multinomial logistic regression model which generates probabilities for our dependent (target) variable, the possible types of next score events within the same half.
In our case, we build 6 logistic regression models fit to the next score types — Offense FG, Offense TD, Offense Safety, Opponent TD, Opponent FG, and Opponent Safety — all except for the class that is used as the base case (i.e. No Score), since that is accounted for in the intercept, as mentioned in part 1.
Since down is a categorical variable, it must be treated as a factor variable, meaning coefficients of the model are given relative to the base case of 1st down.
Thus, we have 2nd down, 3rd down, and 4th down coefficients, with the same reasoning applying to all other interaction variables which include down.
The independent variables are (in parentheses are the number of coefficients fitted to the factor):
- Down (3)
- Distance (to convert 1st down or goal), transformed as log(yards to convert 1st/Goal) (1)
- Yards-to-goal (or field position, or yards to opponent goal line) (1)
- Time remaining in half in seconds (1)
From these independent variables, we derive two other boolean variables:
- Goal-to-Go (1)
- Under two minutes in half (1)
We also include several interaction terms between our independent and derived variables:
- Interaction between distance as log(yards to convert 1st/Goal) and goal-to-go indicator (1)
- Interaction between distance as log(yards to convert 1st/Goal) and down (3)
- Interaction between yards from opponent’s end zone and down (3)
The model outputs the probability predictions of each next score type for every play before the play. Then using the play outcomes to adjust our four main independent variables, it then outputs the model probability predictions for all 7 types post-play.
Next time will cover the underlying next score type probabilities and their influence on the change in expected points as the offense progresses down the field.