lmjohns35 April 2013

It’s March Madness ! I was looking at the brackets for the NCAA men’s basketball tournament the other day, marveling at the number of upsets in this year’s tournament. Then I remembered : wait, aren’t there a lot of upset victories every year in the tournament ?? Particularly in the early rounds ?

Seriously, who hasn’t wondered about that before ? Well, I thought I’d chip in with a graph to help visualize the madness around us, just in time for the looming final four weekend. What I specifically wanted to look at was, how does your team’s seed entering the tournament end up impacting your course through the tournament ? If you’re seeded number 1, is that a certain victory ? Contrariwise, if you’re seeded 16, are you guaranteed to crash and burn ?

I poked around a little bit on the interwebs and found beautiful tables of win/loss statistics for the past 25 years of NCAA men’s basketball tournament history. Thank you Guru ! I reformatted the data from the tables into a plain text format using my lovely text editor and got to work graphing.

I tried a few scatter plots with just the raw data to get a feel for how things looked. In my mind, the general idea for the plot was to create a 16-by-16 scatter plot, with each cell containing some percentage value indicating the likelihood that the seed along the horizontal axis would win that matchup. After some twiddling (see below), I got what I think is a pretty decent result.

In this plot, the shape of the marker indicates the round of the tournament, the size is proportional to the log of the number of times such a matchup occurred, and the color indicates the percentage chance of a win for that matchup. The first thing to notice was that the data have a lot of weird symmetries, in a way : if a 3-seed team wins against an 8-seed team, then you could also equally well say that an 8-seed team has lost against a 3-seed team. So, the raw data only fill up half of the available space on the plot. To fill out the entire plot, I subtracted the win percentage from 100 to get something like the “loss percentage” for the other half of the plot.

The next thing that I noticed was that not all matchups were even possible in the tournament. This is particularly true in the first round, which always pits (in each of the four subtrees of the bracket) the 1-seed team against the 16-seed, the 2-seed against the 15-seed, and so on. So, in the first round, there are a lot of games played, but in terms of seed matchups they align exactly along the diagonal of the graph. A similar pattern emerges for the second-round games : they are mostly along the “reflected secondary antidiagonal” (whatever that might mean ; I bet there’s a math term for this, but I sure don’t know it) of the scatter plot, meaning that the 1-seed team often plays the 8-seed team, but sometimes plays the 9-seed team. The 2-seed team will tend to play the 7- or 10-seed, etc. Interestingly, this pattern falls off precipitously after the second round. Either that’s because of upsets specifically, or because by the third round enough probability has come into play that there are almost no specific guarantees about who will be playing whom.

Another thing the graph makes clear is that you’ve got a much better chance of winning if you’re a top-8 seed, especially if you’re playing against a bottom-8 seed (i.e., the top-left quadrant of the graph). With some notable exceptions (1-seed teams seem to lose often against 11-seed teams in the fourth round !), the high win percentages are held by teams at the top, particularly in the top 4 seeds.

Finally, and perhaps implied by the previous point, the odds of winning the entire tournament are way better if you’re a top-seeded team than if you’re a bottom-seeded team. There are some tiny stars (final four or championship games) on the graph in places like the 5-seed vs 7-seed, but more often than not, you’re going to see a 1- or 2-seed team end up at the top of the tournament bracket. And, perhaps unsurprisingly, but also anticlimactically, the most likely matchup at the end of the tournament is 1-seed vs 1-seed.

## Smoothing

But let’s go back to the beginning of the tournament, since that’s where the upsets happen. In the fullness of time, enough games have been played in the first round of the tournament to give us a really nice, smooth gradient of win percentages as you go down the seeds. Historically, the team that’s seeded number 3, for example, has an 86% chance of winning their first-round game against the 14-seed team. That is, over the past 25 years of tournament history, 112 3-seed vs 14-seed games have been played in the first round of the tournament, and 96 of those games were won by the 3-seed team. The chance of winning decreases all the way down to the 8-seed vs 9-seed first-round game, which is basically a toss-up : there’s a 48% chance that the 8-seed will win its first-round game.

Now, that’s a fine and dandy way of guessing who’ll win these games, but there’s a bit of a problem at the other end of the spectrum. Of the 112 first-round games played between a 1-seed and a 16-seed, never once has the 16-seed team won. If you’re a frequentist, this means that if you’re seeded number 16 in the tournament, you might as well not show up ! Empirically, there’s literally zero chance that you’ll win that game.

But zero is a pretty harsh statistical sentence to level against anyone, even a basketball team. So, we need to do something to fix that – after all, a 16-seedcould theoretically win its first-round game, sending shock waves (and more Madness) throughout the sporting world.

There are tons of really complicated ways of dealing with this problem, all of which amount to some sort of smoothing : replacing the actual counts in the data with slightly altered counts, based on some prior knowledge or hunch about how the data should look. Usually, smoothing in this context implies that you’d like there to be fewer zero counts in your dataset, which you usually accomplish by moving some of the probability mass from high-probability events (like the 1-seed beating the 16-seed) towards low-probability ones (like the upset).

This being a quick jaunt into graph-land, I prefer to stick with the simple solution : Laplace smoothing. Because we want to make sure that there’s at least some nonzero chance of any matchup coming out either way, but we also want to make sure that the real data have a chance to speak up. In other words, we don’t want to claim suddenly that there’s a 50% chance the 1-seed will win against their 16-seed opponent. In Laplace smoothing, we simply add a constant value to all of our empirically collected data. Sounds simple, right ? If we’ve counted $$w$$ wins and $$\ell$$ losses, the unsmoothed probability of a win would be $$\frac{w}{w+\ell}$$. After smoothing, the probability looks like $$\frac{w+\lambda}{(w+\lambda)+(\ell+\lambda)}$$. The nice thing about this method is that you can change the value of the smoothing parameter $$\lambda$$ depending on how much you want to muck about with the original data. For this plot I picked $$\lambda=0.1$$, which is about $$1/1000$$ the size of the largest count that we have in our dataset, so won’t have a big impact on the chance of a 16-seed beating a 1-seed, but still has a decent impact on matchups that have only occurred once in the dataset.

There are probably lots of other cool ways to do smoothing in a dataset like this, especially if you take the seeds into account. For instance, there has been just one 4-seed vs 6-seed game in the past 25 years of the final four, and the 6-seed team happened to win it. So, empirically, the 4-seed has a zero percent chance of winning this matchup again, if it ever comes to pass. But something seems off about that, right ? It would be better if we said, “Well, it’s pretty likely that that game would be close, though the 4-seed should theoretically have a slight advantage, but the 6-seed team that’s made it that far in the tournament must actually be pretty good.” Laplace smoothing doesn’t get anywhere near that sophisticated – it just replaces the $$1$$ win for the 6-seed with $$1+\lambda$$, and the non-win for the 4-seed with $$\lambda$$, changing the chance that the 6-seed wins a similar matchup again to $$\frac{1+\lambda}{1+2\lambda}$$. With $$\lambda=0.1$$, that comes out to be 91.7%. That’s probably still pretty high, but it yields a nice graph.

## Graph details

Right, I was talking about a graph ! So, after smoothing the basketball data :

data = np.loadtxt(...)
data[:, 3] += 2 * LAMBDA
data[:, 4] += LAMBDA

I got a much better-looking plot : suddenly lots of tiny dots appeared on the plot that hadn’t been there before, because those matchups had always gone one way, even if there was only one of those matchups ever !

Similarly, I ended up adding a small amount of jitter to each of the scatter points, to prevent matchups that occurred in different rounds of the tournament (particularly rounds 4 and 5) from overlapping exactly on-center ; I think it’s a little easier to spot the tiny geometric shapes if they’re a bit outside the larger shape.

Finally, I wanted a way to distribute the code for making this graph without resorting to a zip file for pairing the script with the data. There are only about 100 data points in the set, so I figured I could compress it and just include the compressed data directly in the script. Sure enough, it’s really straightforward to do that even from the command line :

python2 -c import sys; print repr(sys.stdin.read().encode("zlib"))' \
< madness.txt > madness.py

Then I wrapped the zipped string with a pseudo-file that could be loaded into the always beautiful numpy :

data = np.loadtxt(StringIO('x\x9c\x95...?_\xd2_\xec'.decode('zlib')))

This makes the data easy to fit directly into the source code for the graph, rather than mucking about with a separate graph-making script and dataset.