How Neudesic built, tested and validated a model for predicting World Cup outcomes
July 10, 2018
Summary of the FIFA world cup
The FIFA World Cup is a soccer competition between nations around the world whom send their best players to represent their country in the competition. The general idea of who qualifies to compete is based on the outcome of games played two years prior to the event. Thirty-two teams, selected in a round-robin style format, then compete in the World Cup in groupings (group stage). The top two teams from each grouping advance to the round of sixteen and compete for the championship in a single elimination format (knockout stage).
Building an ML Model to predict the outcomes of the 2018 World Cup
Neudesic’s Advanced Analytics and Data Science experts, leveraged the Neudesic Advanced Analytics Solution framework to explore, model, and test data legacy data (prior to FIFA establishment) and FIFA data provided by Kaggle for the training of the ML model. The team used Google ML to do their work. Here’s the process they went through:
Kaggle Datasets used for the competition
- This year FIFA released the latest stats for each country in accordance to the countries ranking, point score that year, which games were away or not. Which includes past ranking for every 4 years at the start of each FIFA tournaments from 1872 (legacy data prior to FIFA) to 2018.
- The second file that they released was the results of all past FIFA games that were played in the tournament with how many points where scored by each team for each FIFA game.
- The last file is the line up on the start of the tournament. This includes the order of the teams that play one another up to qualification of round 16.
Data aggregation & feature engineering
- First, all data files must be checked to ensure country names had not changed over time. For example, China is now referred to as People’s Republic of China, USSR is now Russia, etc.
- Second, missing data, as a result of some countries only join the World Cup playing field recently, required that sample data be modeled to fill in gaps.
- Then, data engineering took place. Through this process we:
- Ranked the difference between two teams
- Took the average rank based on past games
- Calculated the point difference of past games for the last year
- Created a binary variable for whether or not the game was a friendly match or not
- This process for data engineering and the inclusion of the variables listed above, mimic the way FIFA officially ranks the teams in order to qualify for the tournament.
Training, testing and validation
- The Training set consists of all tournament games played
- Train set consists of data from all FIFA tournament games played from 1872 to 2002 (legacy data prior to 1930, FIFA data after 1930)
- The Test set consisted of FIFA tournament games played during 2014, 2010, and 2006
- The Validation set was existing data from 2018
Building the ML model
The variables used in the creation of the model, are as follows (note: qualifiers typically start 2 years prior to tournament commencement)
- Average_Rank: Average current ranking according to FIFA
- Rank Difference: Score of the different ranking for the current year
- Point_Difference: Score point difference for that give year
- Is_stake: If the match was friendly or not
Logistic regression, which provides a winning probability (from 0 to 100) of two teams playing each other, was used. Then a predict theoretical point threshold was calculated based on past scoring patterns from the team. This calculation also ensured that each team’s score matched the predicted winning team’s probability.
The model, constructed prior to the quarter final rounds, accurately predicted the outcomes of all four matches. Let’s see if the model holds for the semi-finals and final round. A reminder of the team’s predictions is displayed on the Google Data Studio dashboard below: