# Introduction

“I’m never betting again, that’s three losses in a row!”

Sound familiar? That’s just because you are basing most of your bets on your gut feeling or emotions on the teams you generally favour or dislike. What if I told you there was a much more effective way to win money in the long run with predictive analytics? Sound a bit naïve? Great, you’ve come to the right place! This post is going to cover the likelihood of predicting football outcomes over the course of a season using the Naïve Bayes algorithm based upon the Bayes’ theorem.

**Figure 1: **Bayes Algorithm

I’ll save the heavy statistics on this post, however if you are interested in further reading on this algorithm you can find it here https://eight2late.wordpress.com/2010/03/11/bayes-theorem-for-project-managers/.

Football fans will be the first to tell you that the unpredictability of the beautiful game is what makes it so beautiful, and as a romantic fan I like to allow myself to believe the same, to an extent. On any given day one team can beat another, a David will trump Goliath, a tortoise will beat a hare and a Ray Houghton wonder goal will beat the Italians in Giants stadium. Rejoice one and all, for it is these moments of ecstasy in a game that is slipping further down the scale of honour that rekindles our love for the sport, and as fans go with hopes and dreams to France this summer fully expecting the boys in green to return home as triumphant heroes.

Before I go any further I would like to apologize to the late Johan Cruyff, the father of Total Football and the Orange Generation that really made the game a joy to behold. For it is now that I will bring you crashing back down to earth with a revelation to shock the purists. Football is as predictable as anything else we can measure, and over the course of enough time trends appear that allow us to make such predictions. The law of large numbers (LLN), proved by Jakob Bernoulli in 1713, states that obtaining the average from a large number of trials should be near the expected average value, and that the likelihood of this occurring increases with the number of trials. While similarly the central limit theorem (CLT) implies that the arithmetic mean of independent random variables, with a sufficiently large number of trials, will be normally distributed regardless of whether or not the data is discrete or continuous, but once it has a defined variance and expected value.

# A bit on Bayes

With that in mind I would like to explain a little about the data set I have used. The results of every game from the Premier League the 06/07 season, up to the 14^{th} of Feb in the current season (15/16) were used to run the Naïve Bayes algorithm with the dependant variable set to the full time result (FTR). Naïve Bayesian uses the probability of predicting outcomes based upon particular patterns within the data. The fundamental assumption of the algorithm is that the conditional probability of the individual attributes are independent of each other. For example, for a penalty to be awarded there must be a foul in the box. Naïve Bayes requires independence amongst its attributes.

“But hang on a second, are you saying that there is no link between the likelihood of a goal and the number of shots on target? Or the chances of more goals against a weaker opposition?”

No, not entirely, and this is somewhat of a paradox within the Bayes theorem. Of course there are some things that are related to one another, however, the Naïve Bayes algorithm actually handles these assumptions of independence quite well even when we know there are clear links between attributes. The reason for this is that the theorem only needs to know which of the outcomes possesses a greater probability than the rest, it is not concerned with the actual values of each outcome when generating predictions.

The data that was used for this algorithm was discrete and contained no missing values, suiting the requirements for Bayes’ theorem.

# Results

**Data Frame 1: **Premier League Data

The image above demonstrates the data that was used in the algorithm. The dependent variable was FTR and the conditional probabilities of the independent attributes were used to calculate the rest. HTHG, HTAG stand for half time home goals and half time away goals respectively and GD is the half time goal difference between the teams.

Results of the naïve bayes algorithm from the confusion matrix:

**Table 1: **Confusion Matrix of Full Time Results for All Teams

Fraction of correct predictions from the algorithm:

**Figure 2: **Fraction of Correct Predictions

The model was run a total of 1,000 times to demonstrate what was earlier stated regarding the law of large numbers and the results demonstrate that the mean of 1,000 trials is very similar to the 60% value obtained above.

**Figure 3: **Fraction of Correct Predictions for 1,000 Samples

The results of the 1,000 models are also plotted on a histogram just to highlight the bell shape achieved to prove a normal distribution.

**Figure 4: **Histogram of Number of Observations vs Results Prediction

So what this model shows us is that over the course of 10 seasons, using data obtained between teams at half time, this model can predict with a 60% level of accuracy the predicted full time result. Well that’s not very reliable is it? No it’s not and this is where it is up to the data analyst to select the appropriate data to generate the best possible results. In this instance I would like to be able to analyse something a little more in depth, let’s say shots on goal, or shots on target, or shots within the six yard box. The more relevant information we can add to the model the better it can predict, even with its naïve assumptions.

So as I have no extra attributes to add to the current set I reconfigured the data to include the results of just the so called top 4 over the past decade (Arsenal, Utd, Chelsea, Liverpool) and applied the same criteria to see if this would improve my model.

Results of the naïve bayes algorithm from the confusion matrix:

**Table 2: **Confusion Matrix of Full Time Results for Top 4

Fraction of correct predictions from the algorithm:

**Figure 5: **Fraction of Correct Predictions Top 4

1,000 runs of the model yielded a 5% higher result than the initial run, proving the importance of multiple runs on the data set.

**Figure 6: **Fraction of Correct Predictions Top 4 for 1,000 Samples

Finally, a histogram of the model:

**Figure 7: **Histogram of Number of Observations vs Results Prediction for Top 4

It’s clear from the results that there is an improvement in the level of prediction based on clubs around a similar level compared to the league as a whole. The nice thing about the naïve bayes algorithm is that it is robust to outliers and good at handling new pieces of information when they become available.

# Conclusion

So there you have it, the naïve bayes theorem can indeed be used to predict the outcomes of matches and with some added information this model could look a whole lot better. So you see, we can tell that over a length of time the normal order of things will resume its course…

*Somewhere in the crowd someone asks* “But what about Leicester?”