With the Miami victory over UNC there has been more talk of a possible rematch between FSU and Miami in the ACCCG. This invariably leads to comments about it being difficult for a team to beat another team a second time in one season. Usually these comments are propped up with nebulous statements like "everyone knows" or "statistics show".
These statements have been repeated so many times that they have passed into conventional wisdom. After all, when was the last time you saw a team beat another team twice in one season?
Most of the debate, if any, around the conventional theory seems to involve discussions of why this should be. Some of the reasons given seem sensible:
- The losing team has time to shore up any weaknesses that were exposed in the first game.
- A team learns more from a loss than a win.
- The losing team is more motivated than the winning team.
The list goes on from there. Most of these explanations seem plausible or even likely to me, but only if the original premise is actually sound.
The more I thought about it, however, the less sense the original premise made to me. Sure, in some games the motivation from a previous loss or adjustments made by the previous loser might sway the outcome, but on average, aren't most games simply won by the better team? In which case, wouldn't the better team be more likely to win a second game (again, on average)?
And what about margin of victory? If the winner absolutely destroyed the loser in the first game, wouldn't you expect them to win the second? Besides, whenever someone says "statistics show" without citing a source I tend to get suspicious.
So I decided to look for myself.
The Analysis
James Howell has compiled a comprehensive database of game scores going back to the 19th century. I used his data for this analysis. You can find his data here. I have not verified the accuracy of this data, but a quick spot check of some of last year's scores showed them to be correct.
I am far too lazy to sort this data by hand, so I wrote a short ruby script to parse the data files and determine the number of repeat match ups and corresponding double victories. I'll post the code for this script at the end of this writeup for anyone that might want to repeat the calculation.
I first analyzed the games from the 2000 season through the 2009 season for various margins of victory. The results are given in the following table.
2000 - 2009
Margin of Victory | # of rematches | # of repeat wins | % |
---|---|---|---|
1 pt or more | 24 | 14 | 58% |
3 pts or more | 24 | 14 | 58% |
7 pts or more | 18 | 12 | 67% |
14 pts or more | 10 | 8 | 80% |
21 pts or more | 8 | 7 | 88% |
28 pts or more | 2 | 1 | 50% |
The first column indicates the range of the margin of victory. The second column gives the number of rematches involving that original margin of victory. The third column is the number of times the original winner won the rematch. The fourth column is the percentage of second victories based on the previous two columns.
So, from the first row we see that the from 2000 to 2009, the number of rematches between teams where the original margin of victory was at least one point, i.e., not a tie, was 24. Of those 24 rematches, the original winner won the second game 14 times or 58% of the time.
As the margin of victory goes up, the likelihood of the original winner winning the second game goes up. For instance, when the original winner won by two touchdowns or more, they won the second game 80% of the time. The exception to this is when the original margin of victory was four touchdowns or more (the sixth row). Here the original winner won the second game only 50% of the time. I think the explanation for this dip is straightforward: there were only two such rematches from 2000 to 2009, which are not enough sample points.
Extending the analysis to include games form 1980 to 2009 yields the following:
1980 - 2009
Margin of Victory | # of rematches | # of repeat wins | % |
---|---|---|---|
1 pt or more | 35 | 21 | 60% |
3 pts or more | 34 | 20 | 59% |
7 pts or more | 24 | 16 | 67% |
14 pts or more | 14 | 11 | 79% |
21 pts or more | 8 | 7 | 88% |
28 pts or more | 2 | 1 | 50% |
The results are remarkably consistent with the results from 2000 to 2009, although, admittedly, the last two victory margins involve exactly the same games, so they should be the same.
Extending the analysis all the way from 1900 to 2009 yields the following:
1900 - 2009
Margin of Victory | # of rematches | # of repeat wins | % |
---|---|---|---|
1 pt or more | 226 | 155 | 69% |
3 pts or more | 218 | 148 | 68% |
7 pts or more | 173 | 129 | 75% |
14 pts or more | 101 | 85 | 84% |
21 pts or more | 67 | 62 | 93% |
28 pts or more | 40 | 37 | 93% |
Using this extended date range we see even stronger evidence to dispel the myth. In rematch games where the original game did not end in a tie, i.e., the margin of victory was 1 point or greater, the original winner won the second game 69% of the time. When the margin of victory was four touchdowns or more, the original winner won 93% of the time.
The evidence seems pretty clear to me. I think it comes down to this: more often than not, the better team wins.
So why does the myth persist? For one thing, I think it is because at first blush it seems plausible. For another, I think rematches are so rare that people tend to confuse the rarity of even playing the same team twice in one season with the likelihood of the first winner winning again. Most importantly, though, I think it comes down to the fact that people are generally really bad at understanding conditional probability. People tend to confuse the prior likelihood of a team winning both games with the conditional probability of a team winning the second game once they have won the first.
The prior likelihood of a team winning twice against the same opponent in a season is small. After all, for every time a team won both games the other team lost both. Once a team wins the first game, however, it has won the second 69% of the time. The historical evidence is pretty compelling.
The Code
I have tested my analysis on a small sample file, but I have not done rigorous testing. It's quite possible I have completely screwed up somewhere, so please feel free to give me some peer review here. To make this easier the script I used is given below. The data files must be downloaded from the original site. I ran the analysis on a MacBook Pro, but any Unix variant with ruby installed should work. I'm not sure about Windows machines.
#!/usr/local/bin/ruby
# total number of rematches for various margins of victory
rematch_count_by_threshold = {1=>0, 3=>0, 7=>0, 14=>0, 21=>0, 28=>0}
# total number of rematches won by the original winner for various margins of victory
double_winner_count_by_threshold = {1=>0, 3=>0, 7=>0, 14=>0, 21=>0, 28=>0}
# iterate over all the data files - one file per year
Dir["./*.txt"].each do |file|
year = file[4,4]
# map of games played by each team in the given year
team_games = {}
# open the data file and get its data
IO.foreach(file) do |line|
teamA = line[11,28].strip
teamB = line[43,28].strip
teamAScore = line[39,2].strip.to_f
teamBScore = line[71,2].strip.to_f
if team_games[teamA].nil?
team_games[teamA] = {teamB=>(teamAScore-teamBScore)}
else
if team_games[teamA][teamB].nil?
team_games[teamA][teamB] = teamAScore - teamBScore
else
# found a repeated game
# uncomment the following line for verbose information about rematches
#puts "#{teamA} played #{teamB} twice in #{year}"
original_victory_margin = team_games[teamA][teamB]
if original_victory_margin.abs > 0
rematch_count_by_threshold[1] += 1
if (teamAScore > teamBScore && team_games[teamA][teamB] > 0) || (teamAScore < teamBScore && team_games[teamA][teamB] < 0)
double_winner_count_by_threshold[1] += 1
end
end
if original_victory_margin.abs >= 3
rematch_count_by_threshold[3] += 1
if (teamAScore > teamBScore && team_games[teamA][teamB] > 0) || (teamAScore < teamBScore && team_games[teamA][teamB] < 0)
double_winner_count_by_threshold[3] += 1
end
end
if original_victory_margin.abs >= 7
rematch_count_by_threshold[7] += 1
if (teamAScore > teamBScore && team_games[teamA][teamB] > 0) || (teamAScore < teamBScore && team_games[teamA][teamB] < 0)
double_winner_count_by_threshold[7] += 1
end
end
if original_victory_margin.abs >= 14
rematch_count_by_threshold[14] += 1
if (teamAScore > teamBScore && team_games[teamA][teamB] > 0) || (teamAScore < teamBScore && team_games[teamA][teamB] < 0)
double_winner_count_by_threshold[14] += 1
end
end
if original_victory_margin.abs >= 21
rematch_count_by_threshold[21] += 1
if (teamAScore > teamBScore && team_games[teamA][teamB] > 0) || (teamAScore < teamBScore && team_games[teamA][teamB] < 0)
double_winner_count_by_threshold[21] += 1
end
end
if original_victory_margin.abs >= 28
rematch_count_by_threshold[28] += 1
if (teamAScore > teamBScore && team_games[teamA][teamB] > 0) || (teamAScore < teamBScore && team_games[teamA][teamB] < 0)
double_winner_count_by_threshold[28] += 1
end
end
end
end
# insert entries for team B - no need to count the rematches here as they will have already been detected above
if team_games[teamB].nil?
team_games[teamB] = {teamA=>(teamBScore-teamAScore)}
else
if team_games[teamB][teamA].nil?
team_games[teamB][teamA] = teamBScore - teamAScore
end
end
end
end
puts "The original winner won in #{double_winner_count_by_threshold[1]} of #{rematch_count_by_threshold[1]} rematches where the original victory was by 1 point or greater (#{double_winner_count_by_threshold[1].to_f / rematch_count_by_threshold[1].to_f * 100.0} %)."
puts "The original winner won in #{double_winner_count_by_threshold[3]} of #{rematch_count_by_threshold[3]} rematches where the original victory was by 3 points or greater (#{double_winner_count_by_threshold[3].to_f / rematch_count_by_threshold[3].to_f * 100.0} %)."
puts "The original winner won in #{double_winner_count_by_threshold[7]} of #{rematch_count_by_threshold[7]} rematches where the original victory was by 7 points or greater (#{double_winner_count_by_threshold[7].to_f / rematch_count_by_threshold[7].to_f * 100.0} %)."
puts "The original winner won in #{double_winner_count_by_threshold[14]} of #{rematch_count_by_threshold[14]} rematches where the original victory was by 14 points or greater (#{double_winner_count_by_threshold[14].to_f / rematch_count_by_threshold[14].to_f * 100.0} %)."
puts "The original winner won in #{double_winner_count_by_threshold[21]} of #{rematch_count_by_threshold[21]} rematches where the original victory was by 21 points or greater (#{double_winner_count_by_threshold[21].to_f / rematch_count_by_threshold[21].to_f * 100.0} %)."
puts "The original winner won in #{double_winner_count_by_threshold[28]} of #{rematch_count_by_threshold[28]} rematches where the original victory was by 28 points or greater (#{double_winner_count_by_threshold[28].to_f / rematch_count_by_threshold[28].to_f * 100.0} %)."