March 22, 2019
It’s the most wonderful time of the year for college basketball fans across the nation. Whether you fill out a March Madness bracket using a wealth of in-depth basketball knowledge, or simply make your guesses based on how much you like a team’s colors or mascot, one thing is certain – the outcomes of the tournament are always unpredictable.
This year, as the official sponsor of NCAA March Madness, Google Cloud Platform (GCP) migrated over 80 years of historical and play-by-play data to the cloud. GCP tapped Neudesic, a Premier Partner with Data Analytics Specialization, to leverage their data and analytics tools to predict the winners of each round of the tournament.
The results are in and will be on display at Neudesic/Google Cloud watch parties in major cities like Atlanta, Orange County, Los Angeles and Las Vegas.
Read all the way down to discover our pick for this year’s NCAA Champion!
Here’s how we did it:
Neudesic’s Data Science Team utilized the Google Cloud Platform set of tools including Cloud Storage to store the raw data files, and BigQuery to host data tables, model outputs, and the supporting dashboard data. Twitter capture and sentiment analysis utilized Container Registry (Docker), Kubernetes, Natural Language API, and Cloud PubSub with direct streaming to BigQuery
The team utilized three sources of data:
- Kaggle’s NCAA ML Competition 2019 (a partnership with Google Cloud) dataset: Provided the raw data for the predictive model’s training set. This included recorded historical regular season and tournament performance of Division-I teams starting with the 1984-85 season.
- Live Twitter and Sentiment Analysis: Used to calibrate team-vs-team win probabilities to reflect the effect of “hype” surrounding games being played while the dashboard is live.
- 2019 Pomeroy ratings from kenpom.com: Used as post hoc explanatory features – offensive and defensive ratings, as well as team ranks – within the dashboard to provide more context for predictions. I.e. these ratings were not directly used to train the predictive model, rather their intention was to provide easy-to-visualize intuition into the predictive model.
The predictive model was built using the Python scikit-learn framework for machine learning. The team selected an iterative approach for building the predictive model. We started with a naïve baseline model, which automatically predicted wins for better-seeded team. After a few iterations, we had significantly improved upon the baseline using a predictive model trained on NCAA seeds along with the LRMC steady-state probabilities described by Kvam and Sokol (2006). Model outputs were loaded into BigQuery and subsequently visualized through the DataStudio dashboard.
Data Visualization and Insights Delivery Design
The initial dashboard concept was originally ideated by taking into consideration what people want to know or are most interested in when watching NCAA basketball matches. Through this process, we found what different types of college basketball fans value as important data points- these interests were key to our design approach.
Neudesic’s IDX, 4D process was then implemented to research, discover, and ideate interesting features and views that would effectively surface the insights and help viewers understand our predictions more clearly.
Using Adobe XD to wireframe, Neudesic’s IDX team derived some initial concepts of what the data could look like, and how the dashboard could be laid out. The wires were tested on users (in this case NCAA basketball fans) to validate and define the best visual representation and value of the data.
Google Big Query’s Data exploration feature helped to perform exploratory data analysis and create an initial information architecture within the graphs and charts. The tool allows for analysis of data and allowed us to test its visual and functional capability within Google DataStudio.
Collaboration between data scientists, data engineers and user experience designers at this stage is critical. Data structure and model creation must complement the data visualization and insights development process, and in fact are interdependent on each other for surfacing data in a logical, easy to interpret manner. Collaboration is made easy in Google DataStudio as it allows for easy sharing with advanced permission settings.
Our final dashboards were created using Google DataStudio connected directly to Google BigQuery for a continuous data feed. It allowed for connection to a variety of data models from different data sources, and is found extremely useful for establishing quick, simple visualizations.
Neudesic’s model predicts Gonzaga University as the winner of this year’s Championship (with a 74% probability that they’ll beat North Carolina in the Big Game)!
These types of predictions are not limited to March Madness. Manufacturing companies are using machine and transfer learning for predictive maintenance, and healthcare patients are being diagnosed in real time. Predictive analytics is changing the way all industries do business – helping businesses in every industry better understand customers, markets and competitors, refine operations, and improve and create new business models.
If Neudesic can do it for the NCAA, imagine the performance boost we can give your business.