March Madness Predictor

Can I predict who wins an NCAAB game?

Can I predict winners?  My previous brackets lead me to believe that no, I cannot.  However, I have a pretty smart computer, can that help me predict who will win a game?  Only one way to find out.

I am going to use machine learning to help predict winners.  In short, I want to feed my computer some information (like the two teams playing and some of their stats) and have it give me a predicted result (in my case, a predicted winner).  To do this, I will give my MACHINE old games and their results so that it can LEARN how those inputs leads to the result.  Then, when new games are put it, the computer can predict what the outcome of those games based on how the previous games ended up.  Past performance usually gives some insight on the future, so new games should be similar to past games.

I compiled a data set of all games in March Madness since 2002.  Why 2002? That’s what I had, and when Ken Pomeroy started his efficiency metrics.  Those stats are currently some of the most advanced basketball metrics today, and he makes them available on his web site.  To do the machine learning, I broke down this data set into 3 populations:

  1. Games from 2002-2011
    1. This is the “training set”, which is a lot like a control group
  2. Games from 2012-2014
    1. This is the “test set”, and I try to “predict” these games based on their similarity to the 2002-2011 games.  I already know the results, so I can immediately compare predictions to what actually happened
  3. Games from 2015-2016
    1. I am holding these games out for now, so I can make brackets for these years

Predicting Things with KNN Analysis

Machine learning is a huge topic, and there are many different methods that I could use.  I found that KNN was relatively simple, since it was something I could wrap my head around.    KNN stands for “K Nearest Neighbors”, where K is the number of neighbors.  The direct definition isn’t necessarily that easy to comprehend, so I will use an example.

Let’s pretend you are college wide receiver, and you think you have a shot to be successful in the NFL.  To see if you could do it, you will want to compare yourself to other college WRs who went pro.  By chance, you have a list of every WR who went pro with their height and weight, and if they were successful or not (successful or unsuccessful, not degrees of success).  You put their height and weight on a chart, and you make a big red cross for your height and weight.  You think to yourself, if the guys closest to you were successful, you will probably be successful too.

Chart of WR's Heights and Weights

You look at this chart and compare yourself to everyone around you.  You make a list of how ‘close’ you are to everyone else, by using the good old distance formula, d =\sqrt{x^2 + y^2}.  Sort your list to rank the players by who is the shortest distance from you, and read down that list.  Looking at every player on that list isn’t very helpful, so you only look at the closest 3.  Rod Gardner and Allen Robinson have been successful, so you then calculate that you have a 2/3 chance of being successful.  Congrats, you just did a simple KNN comparison, where you took the 3 “Nearest Neighbors” (K=3).

You could have chosen a different value for k, to compare yourself to more or fewer players.  1 is a pretty small sample size, but it may make sense in another context.  If you have a huge data set with millions of records, you could look at the closest 100 neighbors.  There is probably some rule of thumb, but I didn’t see it anywhere.  For now, let’s stick with K=3.

In the above example, I gave each of the 3 players an even amount of influence on my result.  2 of the 3 were successful, so then you have a 2/3 chance of being successful.  There is another option, where you give closer data points more weight.  Since you are closer to Allen Robinson, you may want his result to influence your prediction more than the other two.  If you choose this option, then you could say that you have a greater than 2/3 chance of being successful.

Obviously, there is a lot more to being an NFL wide receiver than just height and weight.  If you also have the 40 yard dash times for you and these players, you could compare those too.  To add that to the chart, you could set the 40 yard time as your z axis.  Again, you will use the distance formula (now d =\sqrt{x^2 + y^2 + z^2}) to find your nearest neighbor on a 3rd dimension.  Everything else remains the same, even though you are comparing on another dimension.

It is difficult to visualize a chart with a 4th dimension, but that limitation doesn’t exist for computers or our distance formula.  In theory, you can continue to add more comparisons (TDs in college, college games started, number of Facebook friends…) and compute the nearest neighbors on as many dimensions as you have data points.

So there you have it, that’s KNN in a nut shell.  You want to predict a result for a new situation, so you compare that situation to previous situations where you know the result.  On to basketball!

Predicting Who Will Win a Basketball Game

Comparing players is relatively straight forward, but quantifying a basketball match up is a little trickier, since you can’t directly measure how good a basketball team is.  If you use the team’s seed as a proxy for team skill, you could just compare Team 1’s seed and Team 2’s seed and predict that the team with the lower seed will win.  Since I like to root for upsets, I know that the seed isn’t the only measurement needed to know who will win.

Again, I have all the March Madness games from 2002-2016, and some overall team stats.  I want to input two teams and some stats, and have the winner predicted.  I will also need to pick a K value, and whether or not to weight the results.  I will use the actual results from 2002-2011 to predict the results of the games from 2012-2014.

I wrote a python script to do the KNN analysis, where I input the teams from 2012-2014 to predict the output, based on their similarity to the games from 2002-2011.  I then calculate the accuracy rate, (AccuracyRate = \frac{Winners Predicted Correctly}{Games Predicted}).  I tested 3 variables, and compared the accuracy rate based on the different variables:

  1. List of metrics
    1. I have a bunch of metrics, but more data isn’t always better.  I went with gut instinct to make the different lists
  2. Weighting
    1. I assume it is better to weight the results, so more similar games have more of an effect on the prediction, but I want to test it
  3. K number
    1. I didn’t know what to pick, so I tested 1-13

KNN Input Chart

To my surprise, the most accurate I could predict the winners was 70.05%.  That’s a lot better than I can do on my own.  I got this when I used 3 neighbors with unweighted comparisons, which is definitely not what I expected.  I also got this using X4 (my 4th list of metrics), which includes the seed, RPI, KenPom’s tempo, and KenPom’s efficiencies.  It was also the list with the second most metrics, but that is probably just a coincidence.

Even though I knew I was predicting 70% of games correctly, I wanted to look deeper.  Since I am predicting the final score, and I have the actual score, I wanted to see how close I was doing.  I created two calculated fields, actual margin of victory (positive for a Team 1 victory, negative for a Team 2 victory), and predicted margin of victory.  I made a scatter plot for each match up, where the x axis is Actual Margin, and the y axis is the Predicted Margin.  Note: for each game, I have two rows of data so that a team will be in the Team 1 spot and the Team 2 spot.  This chart is showing only the games where Team 1 is predicted to win.

From this chart, there are some themes.  Games are focused along the axes, meaning that many games are close and that I predict many games to be close.  This is why I love March Madness. For the most part, any time I predict a big win, big wins happen.  Probably the 1 vs 16 games.

To control for expected big differences between teams, I pulled in the Las Vegas spreads, and compared the actual and predicted margins to that.  I subtract the Vegas spread from both the actual and predicted margins, which controls for games that should be lopsided.  Also, if I can continually predict against the spread, then I have a good model, and I can quit my job and become a professional gambler.

I am only able to predict against the spread ~52% of the time, compared to picking a winner 70% of the time.  Again, I am only plotting the teams I predict to win.  The blue dots are showing when I pick against the spread correctly, whether that’s for them to cover or not.  The orange dots are showing when I am incorrect, like if I pick the team to cover and they fail to.

From this chart, two things are immediately obvious.  I can see that I predict my winner to cover the spread more often than not, even though they don’t actually cover very frequently.  After a quick check, I found that 70% of games final scores are within 2 points of the spread.  That’s madness.

I tried to break down the predictions into ranges and into rounds in an attempt to fine tune the model.  Unfortunately, going any deeper felt like I was over-fitting the data.  Good for a year, but that wouldn’t help me to predict later games.

 

If I can predict the winner of a game, can I use that to make a good bracket?

In the last section, I made a model to predict the final result of a March Madness game.  In testing that model, I found that I picked games correctly 70% of the time, but I only picked correctly against the spread 52% of the time.  I made that model using games from 2002-2011, and “predicted” games from 2012-2014.  Now I will “predict” brackets from 2015 and 2016, but I will compare against all games from 2002-2014.  I am hoping that with more games, I will improve my win percentage.

If I use that model to make a bracket, how would I do?   It was really hard to find statistics on bracket scores for 2015 and 2016, but I was able to find average bracket scores.  So let’s see if I can do better than average!

2015

In 2015, my real bracket wasn’t very good.  I beat most of my friends, but that isn’t saying much.  I can’t find the actual score of my bracket, as Yahoo only shows the comparison for my groups, and I never join big groups.  If I had this predictor then, how would I have done?

That’t a lot more green than I am used to seeing.  For now, I am settling on comparing this bracket to the average, since I am unable to anything else to compare to.  According to NCAA.com, the average bracket score was 83.26 in the 1-2-4-8-16-32 scoring system (the standard scoring system at most large websites, each round is worth the same points).  In that system, this bracket would have scored 125, or 150% of average.  Not bad, but I really wish I knew what it took to be in the top 10%.

2016

Did anybody have a good bracket in 2016? I sure didn’t, and NCAA.com had the average at 68.18 on the 1-2-4-8-16-32 system.  If you pick a final four team, and no other games correctly, that’s already 31, and the perfect bracket is worth 192.  Going straight chalk beat NCAA.com’s average, which was worth 71 up to the Final 4, and 87 if you picked UNC to go to the finals.  So ya, apparently I wasn’t the only one with a bad bracket.  However, my KNN algorithm ended up with a pretty good bracket, coming in at 133.

There are some really, really bad parts of this bracket.  But overall, this is more green than any bracket I have ever filled out.  There are some crazy parts in this bracket, like having a 10 seed in the Final 4 and a 2 seed losing in the first round, and my KNN predictor taking A&M to the Final 4.  The KNN predictor predicts the final scores, and it picked Villanova over UNC by 1.33 points, which is pretty good considering ‘Nova hit a 3 at the buzzer to win.  Close enough for me.

 

Updating the Model

I wanted to recheck my model with all of the games from 2002-2016, which is 945 games.  To keep with true machine learning principles, I am creating a training data set with 80% of the games, and the test data will be the remaining 20%, chosen at random.  This process of this calculation is the same as what I did on the Prediction Methodology page.  Because each test uses a different training and test data sets, I ran this simulation several times and averaged the results.

Again, the set of metrics in my X4 did really well, and the uniform distribution again performed the best.  I also computed how well I did against the spread, which isn’t pictured here.  However, I found that the X4 metrics, weighted, with K=8 did really well  (3rd best winning percentage, 8th best against the spread).  Since that seemed to work for both, that is what I will use going forward.

2017 Bracket

Well that sure was fun, wasn’t it?  I sure do enjoy it when teams can shoot to win at the end, hold the ball too long, then put up a terrible 3s.  Regardless, I am not trying to call plays, but predict overall outcomes.  For the most part, it wasn’t too far off.  It didn’t predict the failures of Villanova, Duke, or Louisville, but I don’t think many people saw those coming at all.

In the end, this bracket finished in 37,857th out of 215,727 (84th percentile) on NCAA.com’s overall challenge.  It scored 97 points on the traditional scoring, where each round has 32 points divided among the games.  I had 49/67 games predicted correctly, which is 73%.  I am reasonably satisfied with this, even though this is worse than the previous years.  I blame Rick Pitino.

 

2018

IT IS FINALLY TIME!!!!!!!!!!!

Notes:

  • In my 2017 bracket, I used all of the previous data to predict games.  This resulted in predicting that the better team would win, since historically the better teams win more often
  • This year, I am only using the previous 5 years of games for my model.  This may be less accurate overall, but I hope it picks the upsets that should happen
  • I should have tested this more, but I didn’t
  • Next year, I hope to update my model to make win percentages, instead of just picking a winner

 

Post 2018 March Madness Note:

My bracket was terrible.  The prediction used seeds, and it turns out that lower seeded teams generally win.  So the prediction was 99% straight chalk and I am not proud of that.  I am changing this for 2019.

 

Also #1 UVA lost to #16 UMBC which was amazing.  I am not mad that I had UVA as my champion because it was worth having an awful bracket.

 

 

 

What Did I Learn?

  • I can retroactively predict the winner ~70% of the time
  • I can predict correctly against the spread 52% of the time
  • If I could predict the winner every time, March Madness would be a lot less fun
  • If I could predict correctly against the spread more often, Vegas wouldn’t be Vegas
  • My complete data set includes the score for each team.  I ran a simulation where I included ALL metrics in my data set, including the final score.  This lead to me correctly predicting the winner 91% of the time.  When the score was an input, I was still wrong 9% of the time.
  • The KNN predictor makes better brackets than I do, and way better than NCAA.com’s average bracket
  • Adding 2012-2014 to the data set probably helped my accuracy rate, so continuing to add real games each year should improve my bracket every year
  • I am going to use this to beat all of my friends in 2017
  • Maybe next year’s model will beat my friends