clock menu more-arrow no yes mobile

Filed under:

Introduction to Statistics: Football Edition

The past week has taught me that some of you out there still aren't seeing the reason why predicting Florida State's record for the upcoming season should be done using percentages and not a simple win/lose proposition.  It's frustrated me so much that, just for today, I have decided to take off the jester hat and go back to my roots, which is actually statistics.

When I'm teaching an Intro to Stats class and a significant proportion of the class fails a test, I often wonder if it's because I haven't done a good enough job explaining the concept.  So, using that same strategy, today I thought we should take a look at why we use the percentage model and why it is statistically the most rational method.  We're going to do that by introducing a few of the basic concepts that you'll find in any introductory level stats course and show how these concepts can be applied to the prediction of football outcomes.


Random Variables

A random variable is any variable whose value is obtained through the outcome of a statistical experiment, and are highly important in the study of probability, stochastic processes, and game theory.  The values of the random variable follow a random, but known, statistical distribution.  In other words, you know the exact probability of getting each individual value of the random variable.  It is important to note that the sum of all possible events in a given random variable is equal to 1. 

Random variables fall into two major categories: continuous and discreteContinuous random variables are variables that can take an uncountably infinite number of values.  Temperature on a given day can be considered a continuous random variable, because if we had a thermometer that was accurate enough we could show that the temperature was 75.0 degrees, or 75.01 degrees, or 75.0000000000000001 degrees; there are an infinite number of possibilities.  A discrete random variable can only take specific, pre-determined, values on the number scale.  The number of people in a classroom can be considered a discrete random variable as there can only be 1, 2, 3, 4, .... people (fractions are not possible).  We will focus our attention on discrete random variables from here on.

The most basic discrete random variable model is the process of flipping a coin.  Let X represent the process of flipping a coin one time.  X can then be summarized by:


If we then know the probabilities associated with each outcome, we can give the probability mass function (pmf) of the random variable X.


The above pmf shows what we all know.  There is a 1/2 chance of getting a tails when flipping a coin and a 1/2 chance of getting a heads.  For your own reference, a discrete random variable with 2 distinct outcomes and known probabilities is known as a Bernoulli distribution. Generally one outcome is considered a "success" and the other a "failure."  The Bernoulli distribution requires a known probability of a success, usually called p.  Since the sum of the probabilities of all possible events of a random variable is equal to 1, then probability of a failure is given by (1-p)

Expected Value

The expected value of a random variable is formally the integral of the random variable over its probability measure.  What this means for our discrete random variable is that the expected value is a weighted average of the values of the random variable.

Consider a single fair die (dice).  If the die is rolled one time, the random variable can take values 1,2,3,4,5,6 each with probability 1/6 (the 6 being 6 sides).  Therefore, the expected value of this random variable is given by


If you roll a fair die one time, you should expect to get around 3.5.  Sometimes you'll get higher, sometimes lower, but the baseline value you should expect to get is 3.5.

Applying this same process to our Bernoulli example earlier, one flip of a coin results in a zero 50% of the time and a one 50% of the time.  The expected value of this random variable can then be given as 0*.5+1*.5=.5.  Your expectation before flipping the coin should be of getting a value of .5  (heads half the time, tails half the time).

Linear Property of Expected Values

One extremely important truth about expectation is that they can be considered linear meaning, basically, that


the expectation of the sum of two random variables is equal to the sum of the expectations of the random variables. 

Therefore, if we flipped a coin twice and wanted to find out the expectation of their combined values, we could find the expected value of each trial (or flip) and add them together to get the overall expectation (Hmm, this does seem familiar to the best method of predicting football games.  Taking the percentages and adding them up.)  Each individual trial has an expected value of .5, so their combined expectation would be equal to 1.

Continue reading and we will apply all of these techniques to the process of predicting the outcomes of football games.

For this section, we're going to go back to the Bernoulli distribution one more time.  Since 1996 with the addition of overtime to the college game, there are only two possible outcomes to a football game: a win or a loss.  However, before the game has happened we do not have the information to say which outcome will actually occur.  We can generally assume that a certain team should be favored to win and by how much in some way.  In this regard, we can estimate the probability that Team A wins the game. 

On TNation, we have done this using the proportional win shares model used by Las Vegas.  An ambitious statistics fan could potentially estimate these probabilities using a logistic regression model of off-season factors that lead to wins in the regular season or through some other technique. We're not asking you to do that.  Regardless, we can find a probability of Team A's success in a given game.

Therefore, the outcome of any single game can be considered a Bernoulli random variable.  We have two potential outcomes: Win (which will we use as our "success") and Lose (failure).

For the sake of simplicity of calculation, successes will be coded as 1 and failures will be coded as 0.  Our estimated probability that Team A wins the game is our value for p.  How likely is FSU to win that game?  We list that as "p"

Therefore, the expected value for any single game is given by: 1*p+0*(1-p)=p.  In other words, for this particular game, we can expect to get p wins.

Assume we have 12 of these random variables (G1,...G12) representing each game of the football season.  Each variable has the win/lose (1/0) possible outcomes and its own probability of winning.  In order to find out how many wins a team should expect in the given season, we should find the expectation of G1+G2+G3+G4+....+G12.

Due to the linearity of expectations we discussed before, this equates to finding:


We have already established that the expected value (expectation) of each individual game is the p of that specific game.  Therefore, if we want to find the expected number of wins for the entire season we should simply add up each of the 12 p's we have chosen.

Estimation of the correct probabilities of each game remains an open question.  At Tomahawk Nation, we use a fairly unscientific approach that calls for each member to project what they think the line of the game will be in a method loosely related to the proportional win shares technique.  There are several other options that are open for debate as to how to accurately estimate Team A's chances of winning the game.  What cannot be mathematically debated however, is that this technique is superior to the simple "We're better than Team B so that's 1 win, we're better than Team C so that's 2 wins, etc" method that many of you still inexplicably use.  That technique just isn't realistic and given the better method, amounts to ignorance or willful blindness.  Vegas definitely appreciates you using that method when you throw down wagers on the "season win totals" bets they offer.

Hopefully this quick tutorial explains the statistical justification for this approach.  I hope this clears up any questions you might have had regarding why it is we are so set on using the percentage technique.