Boyd's World-> Filing Cabinet-> Stuff They'd Be Better Off Not Knowing About the author, Boyd Nation

Stuff They'd Be Better Off Not Knowing

Executive Summary

Each year as the selection committee sits down to select the thirty-four at large teams that receive bids to the tournament (as well as making the other decisions around seeding and hosting that they make), they are given a report (officially, if playfully, called The Nitty Gritty Report) that contains a number of statistics for each team. Other than the RPI and the team's overall win-loss record, these stats represent some subset of the team's season -- things like road record, record in last ten games, or record against several RPI subsets. These sub-stats and their use are well-known; you hear them all time in bracketology discussions or in committee justifications of their decisions. It turns out that these sub-stats are detrimental to the process and shouldn't be used as part of the selection criteria.

Philosophy

It's easy to understand why the NCAA support staff provides the data and why the committee wants it -- in general, more data is a good thing during a decision-making process, so if more data was available, it would be a good thing. However, what is actually provided is not more data, it's actually the same data cut into smaller slices. The problem with this is that the college season is just barely long enough when considered as a whole for there to be a chance to rank teams, and any subset of the games just throws away too much information to be useful.

Before getting into the methodology and proof, it's worth a moment to ask a question that I suspect relatively few of the committee members have asked themselves: What question are they trying to answer? There are three possibilities that I can think of, so we'll run through those to see which one they act like they're working on.

Methodology and Proof

One of the side effects of my favorite ranking system, the ISR, is that there is a formula for predicting how often a team will win against a given opponent based on the gaps in their ISR's. There's a predictable adjustment for home field advantage, and, in the postseason, there's a documented advantage for postseason experience. If we use this formula to look at a large set of games, such as all postseason games from 1999 to 2009, we can look at a given characteristic to see if teams with that characteristic win their games about as often as the formula would predict; we'll call this a comparison of actual vs. expected wins.

If something is a useful predictor of postseason success above the information about the team's whole season contained in a full-season ranking like the ISR, then you would expect the actual wins to consistently exceed the expected wins. This is what happened with home field advantage and postseason experience.

As a control illustration, if you look at each letter of the alphabet to see how often the team with more of that letter wins (how often does the team with more a's win, how often does the team with more b's win, and so on), you'll find that the results vary compared to the expected wins but generally stay pretty close (almost all of the actual win numbers are within 3% of the expected wins) and are more or less split between higher and lower. In other words, it looks like random variation around a predictable pattern.

Now, let's look at what happens with the different factors that are presented in the Nitty Gritty Report:

Stat   Games   Expected Wins   Expected WP   Actual Wins   Actual WP
 
Record   1477   952.3   0.645   941   0.637
Non-conf record   1489   965.8   0.649   945   0.635
Conf record   1488   824.2   0.554   819   0.550
Road record   1490   898.3   0.603   876   0.588
Last 10 games   1264   700.7   0.554   697   0.551
Base RPI   1477   952.3   0.645   941   0.637
Non-conf RPI   1485   1008.8   0.679   967   0.651
Conf RPI   1418   747.5   0.527   735   0.518
OWP   1487   976.3   0.657   958   0.644
OOWP   1477   994.3   0.673   947   0.641
Non-conf OWP   1487   857.0   0.576   834   0.561
Non-conf OOWP   1468   947.9   0.646   926   0.631
Record vs RPI 1-25   1461   965.6   0.661   980   0.671
Record vs RPI 26-50   1474   985.6   0.669   944   0.640
Record vs RPI 51-100   1478   989.3   0.669   953   0.645
Record vs RPI 101-150   1474   971.3   0.659   961   0.652
Record vs RPI 1-100   1490   1018.5   0.684   996   0.668
Record vs RPI 1-150   1493   1014.9   0.680   988   0.662
Record vs RPI 151-bottom   1472   948.2   0.644   935   0.635

As you can see, virtually all of the factors that are presented in the Nitty Gritty Report are counterindicated as predictors of postseason success. As an example, to be clear, based upon their ISR relative to their postseason opponents over the last 11 years, you would have expected the team with the better road record to have won 898 of their 1490 games, but they've only won 876, a 1.5% deficit. The only factor to have outperformed the expected win mark is record vs. RPI top 25, which has a 1% plus mark. Most of these are relatively small negatives, but the consistency of the data is telling; these are not factors which predict postseason success and, in fact, tend to predict postseason underachievement instead.

The reason for that is not particularly important, given that the facts above are enough reason to stop using them, but it's worth attempting an explanation to make things clearer -- in almost all of these cases, what's being looked at is a minority of the season. As a simplifying case, if you have two teams that are equal over the course of the season, the fact that one team is better in a small subset (like last ten games, for example) means that that team has actually been less good over the larger rest of the season.

Throwing away data is a bad thing. Remember that the next time you want to argue a team's case based on some small portion of their season like record vs RPI top 100 or something.

Google

Boyd's World-> Filing Cabinet-> Stuff They'd Be Better Off Not Knowing About the author, Boyd Nation