Having recently finished a course in data analytics, I have decided to dedicate this post to an area that is fundamental to the understanding of data analysis and model building itself. This is really aimed at anyone unsure on what data types they are analysing, people who are knew to the subject, or for anyone that needs a refresh on some of the core principles.

Have you ever been assigned a task to do, the instructions are clear, “I want you to run a report on………..” and all you have to do is follow them to the letter and the job is done, you can go about your business once again in whatever way you please. People are a funny race in that we are so diverse in our thought process to certain instructions, but in the above example particularly in the case of an assignment, I find there tend to be three main types of approach.

- Do exactly what it says on the tin
- Something’s not quite right, but I haven’t the time to dwell it on for long
- Something’s not quite right….. (24 hours later) WHY DOESN’T THIS MAKE ANY SENSE!!!

Now I just want to say that none of these approaches are more right or wrong than any other. If you fall in to category one, well you have simply done exactly what you were told to do, if there is any fault it is not on your part, but down to the instructions.

If you are part of the second category then you realise that you aren’t quite happy with what you are being told, however you are pressed for time and if you cannot figure it out sooner rather than later then you revert to category one and move on.

I fall into this last category, and a lot of the time I wish I was in one of the first two given the headaches this can cause. You feel something is wrong, but for the life of you, you can’t find the correct solution, all you can confirm is that what you are being asked makes no sense whatsoever and so you are frantically trying to understand which point in the instructions is off and causing you all of this hassle. Actually, the ideal scenario would be to be in category number four, not listed because these are the rare kinds of people who glance at instructions, identify the problem, address it and make it better than it was originally all in the space of a couple of minutes, I’m talking about you Darren!

Anyway, without rambling on any further I would like to talk about the main objective of this blog, to understand data types and the appropriate analytical tests to use them for. For in the field of data science it is unfortunately not always just as easy as doing exactly what it says on the tin, so by getting a solid understanding of the fundamental concepts of data, you can save yourself a big headache in the long run.

In relation to data types, they can be split into two categories, discrete data and continuous data. Discrete data, according to Stevens (1946), is concerned with nominal (sometimes called categorical) and ordinal data. Nominal data types can be thought of as categories such as the counties of Ireland or number of students in a class. Ordinal data is similar to nominal except that it has an order or rank to it. For example, ordinal data would be something like a survey asking you to rate your stay as “very poor/poor/ok/good/very good”. These are five categories but there is a ranking to them, very good is better than good etc. However, it is important to note that while they have a rank, they do not have a value, we cannot say that very good is three times better than ok. Some textbooks also use the term dichotomous to describe data, but this is just a term that describes nominal data that has two classes, for example “Male and Female”.

**Figure 1: **Discrete Data

Stevens (1946), describes continuous data as being measurable values, with an infinite number of possibilities. It is split into interval and ratio data. Interval data can be thought of as numerical data measured on a scale, for example temperature is a form of interval data. Ratio data is a form of interval data where 0 is not a measurement, it represents nothing. An example would be the height or weight of a person. It can also be ranked on a scale as it has measurable values, 20cm to 10cm has a ratio of 2:1.

**Figure 2: **Continuous Data

Below is a features table to differentiate between the two data types:

**Table 1: **Discrete vs Continuous Data

The above table lays out the basic principles for each of the two data types however there are a few exceptions to the rule regarding what warrants continuous data types which I will now explain.

One thing that had my head spinning at the time was when I was assigned a task of running an ANOVA test on an Airbnb data set. If you want to read more about this you will find it here:

http://karl.dbsdataprojects.com/2016/04/01/airbnb/

Basically this task required us to run an ANOVA test on a dependent variable which was countries of the world. Now immediately I was thinking there is a problem here as countries are classified as discrete data, they can be counted, they are simply not numeric and even if they were there is no particular order to them? However, according to some statisticians there is a way by organising your data such that some forms of ordinal data can work well to suit continuous tests.

According to Newsom (2007), in instances where there are a lot of categories of ordinal data (greater than 4), the data can be treated as continuous. While this is acknowledged in this particular study Newsom states that it is a topic of debate and so there will not always be full agreement on it, yet it can be a good starting point especially if you are only starting out.

Here is a diagram from Newsom’s paper regarding data types and tests depending upon what type of data your dependent and independent variables fall in to.

**Table 2: **Variable Classification and Statistical Procedures

To conclude, a good understanding of data types and how to apply them to certain tests is vitally important to anybody looking to work in the field of data analytics. I would suggest a decent knowledge early on with these principles can help you comprehend things like normal distribution, binomial distribution, parametric and non-parametric testing and enhance your statistical knowledge greatly. Statistics can be fascinating if understood and applied correctly. Otherwise you may end up doing presentations as so…….

- Newsom, J. (2007). Levels of Measurement and Choosing the Correct Statistical Test.
*Material de Aula.* - Stevens, S.S. (1946). On the theory of scales of measurement.
*Science,*103, 677-680.

“I’m never betting again, that’s three losses in a row!”

Sound familiar? That’s just because you are basing most of your bets on your gut feeling or emotions on the teams you generally favour or dislike. What if I told you there was a much more effective way to win money in the long run with predictive analytics? Sound a bit naïve? Great, you’ve come to the right place! This post is going to cover the likelihood of predicting football outcomes over the course of a season using the Naïve Bayes algorithm based upon the Bayes’ theorem.

**Figure 1: **Bayes Algorithm

I’ll save the heavy statistics on this post, however if you are interested in further reading on this algorithm you can find it here https://eight2late.wordpress.com/2010/03/11/bayes-theorem-for-project-managers/.

Football fans will be the first to tell you that the unpredictability of the beautiful game is what makes it so beautiful, and as a romantic fan I like to allow myself to believe the same, to an extent. On any given day one team can beat another, a David will trump Goliath, a tortoise will beat a hare and a Ray Houghton wonder goal will beat the Italians in Giants stadium. Rejoice one and all, for it is these moments of ecstasy in a game that is slipping further down the scale of honour that rekindles our love for the sport, and as fans go with hopes and dreams to France this summer fully expecting the boys in green to return home as triumphant heroes.

Before I go any further I would like to apologize to the late Johan Cruyff, the father of Total Football and the Orange Generation that really made the game a joy to behold. For it is now that I will bring you crashing back down to earth with a revelation to shock the purists. Football is as predictable as anything else we can measure, and over the course of enough time trends appear that allow us to make such predictions. The law of large numbers (LLN), proved by Jakob Bernoulli in 1713, states that obtaining the average from a large number of trials should be near the expected average value, and that the likelihood of this occurring increases with the number of trials. While similarly the central limit theorem (CLT) implies that the arithmetic mean of independent random variables, with a sufficiently large number of trials, will be normally distributed regardless of whether or not the data is discrete or continuous, but once it has a defined variance and expected value.

With that in mind I would like to explain a little about the data set I have used. The results of every game from the Premier League the 06/07 season, up to the 14^{th} of Feb in the current season (15/16) were used to run the Naïve Bayes algorithm with the dependant variable set to the full time result (FTR). Naïve Bayesian uses the probability of predicting outcomes based upon particular patterns within the data. The fundamental assumption of the algorithm is that the conditional probability of the individual attributes are independent of each other. For example, for a penalty to be awarded there must be a foul in the box. Naïve Bayes requires independence amongst its attributes.

“But hang on a second, are you saying that there is no link between the likelihood of a goal and the number of shots on target? Or the chances of more goals against a weaker opposition?”

No, not entirely, and this is somewhat of a paradox within the Bayes theorem. Of course there are some things that are related to one another, however, the Naïve Bayes algorithm actually handles these assumptions of independence quite well even when we know there are clear links between attributes. The reason for this is that the theorem only needs to know which of the outcomes possesses a greater probability than the rest, it is not concerned with the actual values of each outcome when generating predictions.

The data that was used for this algorithm was discrete and contained no missing values, suiting the requirements for Bayes’ theorem.

**Data Frame 1: **Premier League Data

The image above demonstrates the data that was used in the algorithm. The dependent variable was FTR and the conditional probabilities of the independent attributes were used to calculate the rest. HTHG, HTAG stand for half time home goals and half time away goals respectively and GD is the half time goal difference between the teams.

Results of the naïve bayes algorithm from the confusion matrix:

**Table 1: **Confusion Matrix of Full Time Results for All Teams

Fraction of correct predictions from the algorithm:

**Figure 2: **Fraction of Correct Predictions

The model was run a total of 1,000 times to demonstrate what was earlier stated regarding the law of large numbers and the results demonstrate that the mean of 1,000 trials is very similar to the 60% value obtained above.

**Figure 3: **Fraction of Correct Predictions for 1,000 Samples

The results of the 1,000 models are also plotted on a histogram just to highlight the bell shape achieved to prove a normal distribution.

**Figure 4: **Histogram of Number of Observations vs Results Prediction

So what this model shows us is that over the course of 10 seasons, using data obtained between teams at half time, this model can predict with a 60% level of accuracy the predicted full time result. Well that’s not very reliable is it? No it’s not and this is where it is up to the data analyst to select the appropriate data to generate the best possible results. In this instance I would like to be able to analyse something a little more in depth, let’s say shots on goal, or shots on target, or shots within the six yard box. The more relevant information we can add to the model the better it can predict, even with its naïve assumptions.

So as I have no extra attributes to add to the current set I reconfigured the data to include the results of just the so called top 4 over the past decade (Arsenal, Utd, Chelsea, Liverpool) and applied the same criteria to see if this would improve my model.

Results of the naïve bayes algorithm from the confusion matrix:

**Table 2: **Confusion Matrix of Full Time Results for Top 4

Fraction of correct predictions from the algorithm:

**Figure 5: **Fraction of Correct Predictions Top 4

1,000 runs of the model yielded a 5% higher result than the initial run, proving the importance of multiple runs on the data set.

**Figure 6: **Fraction of Correct Predictions Top 4 for 1,000 Samples

Finally, a histogram of the model:

**Figure 7: **Histogram of Number of Observations vs Results Prediction for Top 4

It’s clear from the results that there is an improvement in the level of prediction based on clubs around a similar level compared to the league as a whole. The nice thing about the naïve bayes algorithm is that it is robust to outliers and good at handling new pieces of information when they become available.

So there you have it, the naïve bayes theorem can indeed be used to predict the outcomes of matches and with some added information this model could look a whole lot better. So you see, we can tell that over a length of time the normal order of things will resume its course…

*Somewhere in the crowd someone asks* “But what about Leicester?”

]]>

Ever have the feeling that there are too many possible scenarios to dealing with a situation? Let’s take college work load for example. Ok, so you’re coming into your last week of the semester and you have three separate modules, all of which have assignment deadlines coming thick and fast, plus you have to begin studying for your exams! What is the optimal approach to dealing with this? Do you try and get all of the assignments done by ordering them into different bundles by assignment deadline? Or maybe you should sort them by order of marks per each assignment? Or why not order them by favourite subject to least favourite? So many possibilities but so little time. Wouldn’t it be easy if there were some way that you could bundle all of your options into an optimal number of groups or clusters? Well that is what the principle of K-means clustering algorithm is based on.

Clustering. K-means clustering. It is a form of unsupervised learning, designed to help the user sort out a set of data into a number of clusters. It is not to be confused with K-nearest neighbour (KNN), which will also be dealt with in this post. While I will discuss these two topics one after another, I feel it may be beneficial to lay some basic principles of differentiation between the two before I go any further, as initially I had assumed the K’s were related in both of the algorithms. This is not the case, K-means clustering is an unsupervised clustering algorithm, whereas KNN is a form of supervised learning and is a classification algorithm used for predictions.

A real life example of this might as stated above with the assignment scenario, how can I break these up into clusters? Whereas in the case of KNN, it’s more a case of finding patterns in historical data to try and predict future trends. In the case of a college example it might be reading through past papers to see if you can identify previous patterns that may help you identify what will be on your upcoming exam. Hopefully this is starting to make some sense now before we look at the data. If you want to look into either of the algorithms further then you can do so here:

http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/

And here:

https://www.datacamp.com/community/tutorials/machine-learning-in-r

The data set for this blog is based upon the physical attributes of basketball players, football players and jockeys. The data is all continuous as needed for both algorithms and there are no missing values within the data set.

To determine the optimal number of clusters the set should be split into there are two common methods that can be used in RStudio. The first is a visual plot of the within group sum of squares versus number of clusters. The important thing about this plot is that you want to identify where the clusters start to level out:

**Figure 1: **SSW vs Number of Clusters

From the above graphic it seems to suggest that 3 is a good number of clusters to choose. Let’s also run an NbClust() function to double check:

**Table 1: **NbClust table

The table above confirms what the graph has told us and we can also plot this information using a bar chart:

**Figure 2: **Bar chart of Number of Criteria vs Clusters

We shall select 3 clusters for our analysis.

Cross tabulation of the K-means algorithm reveals where each of the players from each sport was placed:

**Table 2: **Confusion matrix of K-means analysis

This displays that jockeys were easily placed in one category and nearly all of the footballers also, with a small bit of a crossing in the basketball players. We can view how well the groups were divided by plotting a graph of the distributed cluster analysis and the actual distribution of the players:

**Figure 3: **Cluster distribution vs Actual distribution

The left hand side shows how they were distributed with some crossover. The right is the actual distribution.

Finally the RandIndex() function in RStudio tells us the level of agreement between the division of each cluster. A score of 1 represents perfect agreement and a score of -1 represents no agreement at all. The score here is 0.9, which means it has quite a high level of agreement between each of the clusters.

**Figure 4: **API Score

The data set is the same as was used for the cluster analysis. Just by visualizing the data we can see a small positive correlation between height and weight across the sports with jockeys being lightest and smallest and basketball players being tallest and heaviest.

**Figure 5: **Scatter Plot of Weight(kg) vs Height(cm)

As was said previously, KNN is a prediction model, so the data is separated into a train and test set to see how accurate the prediction is. When creating the KNN algorithm there are really two important rules you should consider:

- Make sure the k number is odd so that a tie cannot occur
- Try and make the k number not equal to, or a multiple of the number of

neighbours in your training data set as it also may result in a tie

In the case of this data set k was set as equal to 5. The results are displayed below:

**Table 3: **KNN Cross Table

The cross table displays a high level of accuracy in terms of predicting players by sport. The only one that wasn’t predicted 100% accurately was the basketball players. However, it should be noted that the basketball players were the youngest of the three groups and had the highest standard deviation, meaning that some of the players were still young teens and so while they may have been tall, their physicality may have been misjudged in the weight category.

To summarise, both of these algorithms are effective in their own right and can be used well on data sets containing continuous attributes. This particular data set worked very nicely to demonstrate both the clustering and classification algorithms, K-means and KNN.

]]>** **

The purpose of this report was to analyse the Airbnb dataset to see if there were any factors that may affect the choices with regards destination outcome. Data was gathered from kaggle.com and scrubbed and cleaned before inferential statistical tests such as one-way analysis of variance (ANOVA) was run on a set of independent variables, with a significance level set to p < 0.05. Statistical analysis was carried out using RStudio version 3.2.3 and all figures and graphs were done using this software. This paper was about exploratory data analysis and quality of testing, with no reference to any predictive analytics. However, it is the view of the author that a predictive model would be a good addition to the data gathered from the results section. There was high levels of significance found across the independent variables and conclusions were drawn appropriately from this information.

*Keywords: Airbnb, ANOVA, Data Analytics, RStudio, Data Quality*

*Background*

Founded in 2008, Airbnb connects over 60,000,000+ people across 34,000+ cities in 190+ countries across the globe (“Airbnb”, no date [1]). In November 2015 Airbnb announced that they would be using an “Expansive Suite of Personalized Tools to Empower Hosts” (“Airbnb”, no date [2]). Essentially what the company aim to achieve in the coming year is a smoother transition from web searching to destination arrival. They seek to do this by using these tools to better empower hosts and to personalise their individual needs. The business model aims to achieve a more customer friendly environment by allowing hosts to spend more time connecting with their guests than they do managing their listings. According to Joe Zadeh, the VP of Product at Airbnb, the team have spent time and research into understanding the needs and requirements of the hosts through feedback and testing. Joe highlights the importance of host input by acknowledging it is a “company founded by hosts and led by hosts” and that tailoring their company towards the needs of the user will ultimately provide a greater product.

Data analytics is a part of almost every organisation in the modern world, whatever the area is there is almost certainly an element of data analysis at the core of its industry. Using the analysis appropriately however, is what separates a successful organisation in the ever expanding era of big data (Chen and Chiang, 2012 and Keim and Qu, 2013).

According to a Bain & Company (2015) report, trends over the coming years indicate a strong investment in digital transformation to confront risk, fuel innovation and growth, and to master complexity. 64% of the executives interviewed were unanimous in their thinking regarding spending on IT, indicating that it must begin to rise with regards to percentage of sales. The report highlights a positive trend in relation to profit as a consequence of IT systems and also, a majority of the surveyed population acknowledge they are using advanced analytics for business transformation and marketing strategy. Big data analytics are reported to help establish valuable insight and fuel growth from large volumes of data with industries such as healthcare, telecommunications, financial services and technology all acknowledge the application of such techniques and tools to enhance quicker and better decision making.

Surowiecki (2013) reported in the New Yorker business section the end of an era for a gargantuan player in the mobile phone industry. Nokia’s stock was sold to Microsoft for a fifth of the value it had been worth in 2007. The report indicates how Nokia, at their peak, were earning over 50% of all profits in the mobile phone industry. By the time they were selling they were earning below 3%. A key factor that the company acknowledge was the failure to design a good enough smartphone suitable to the user. So although they were aware the times were changing, the R&D department couldn’t strategically analyse what the consumer wanted, and the android and apple technologies took the market by storm.

*Statement of Purpose*

The purpose of this investigation was to apply the use of data analytics to an Airbnb sample customer data set with the intention of identifying any interestingness within the population data.

*Participants*

Data was collected from an Airbnb .csv file for a total of 213,451 subjects. All data submitted to Airbnb was done so by participant’s agreement to the company policy. Participants were selected based on global samples of anybody who completed the online sections of booking on the Airbnb website.

*Procedure*

In order for subjects to be valid for selection, the data needed to be scrubbed and organised so that participants ranged in age from 18-80 and gender selection was either male or female. The country_destination column was edited to remove all “NDF” values and leave a total of 11 possible destinations. After this process a valid population size of 55,371 participants remained.

*Statistical Analysis*

A one-way analysis of variance (ANOVA) was used to determine any possible significance between groups, with country_destination set as the dependant variable. Weighted means were taken from each of the 11 independent variables across the 11 possible country destinations to perform the analysis, with significance set to p < 0.05.

*Tool Selection*

RStudio version 3.2.3 was used for all data cleaning and statistical analysis.

*Requirements*

Participants had to fit the selection criteria outlined in section 1. of the Methods section.

**Figure 1: **One-way ANOVA

A one-way ANOVA was run between 11 variables, with the dependant variable set as country_destination. There was a significant difference between age and country_destination (F = 49.436, p < .05). Similarly other independent variables such as signup_flow (F = 60.331, p < .05), affiliate_channel (F = 27.812, p < .05), affiliate_provider (F = 2.577, p < .05), first_affiliate_tracked (F = 2.016, p < .05), signup_app (F = 13.286, p < .05), first_device_type (F = 3.403, p < .05) and language (F = 3.541, p < .05) all displayed a significant difference between groups with relation to country_destination. Gender, signup_method and first_browser were not significant below the p value in terms of country_destination.

Based on the results of the ANOVA it can be observed that there is quite a number of significant observations within the tested data. There were large statistical significance found in 8 of the 11 variables tested. In relation to the age variable this is unsurprising, as there is quite a large range (20-80) in terms of participant age meaning that some destinations were more visited than others at certain ages. Gender was perhaps another interesting variable that could have been significant. In this case it was not significant below the p < 0.05 mark, however it would have been below p < 0.10 so was quite close to the mark. The reasons for this being the case could be down to the fact that a lot of couples would tend to travel together and that over the spread of 11 country destinations there would expect to be more of an even effect amongst gender, especially given the fact that only 10 countries were classified, meaning the other category could be anywhere else in the world. Perhaps a future study could examine gender in closer detail by examining trends on a smaller scale such as destinations within Europe for males and females under 40. In relation to the significant differences in terms of methods of signing up, device type, browser, language and provider, this was again as expected probably as a consequence of age, personal preference and country of origin. If windows laptops sell for cheaper than macs in poorer countries then it’s a fair assumption that the majority of that population will use a windows product or that windows based systems will be used amongst internet cafes in this area and vice versa. Again the likes of Facebook joining, versus basic signup methods can be argued down to a generational difference and the likelihood of internet browser is potentially down to whichever is the custom built in one on a subject’s laptop or desktop.

With regards the data selection process, of the 213,451 entries that were recorded, only 25.94% of this data was used for the results section of this study for three main reasons. The age category had some values in the thousands that had to be stricken from analysis immediately, probably because some of the participants were entering dates of years rather than age, as there was a high number of 2014 and 2015 entered in the age category. Similarly with the gender category it was broken into four parts, and finally the inclusion of all of the NDF for country of destination was entirely useless as the dependant variable of country destination needed to be specified for it to be used in the study. For any future research into country destinations the author proposes a max age limit of 100, do not allow NDF selections to register and remove the option of both “other” and “unknown” in the gender category.

The one-way ANOVA test was chosen to statistically analyse the data for several reasons. The dependant variable used in trials was country_destination, in some cases it would be considered a form of discrete data due to the fact that countries can be classed as nominal measures as there is no particular rank order between them. However, in instances where there are a lot of categories (greater than 4), the data can be treated as continuous, though it is acknowledged that this is a case of some debate amongst statisticians (Newsom, 2007). The dependent variable can also be classified as having a normal distribution, (see Appendix I for means sample testing) yet while the data can be treated as continuous, to test for normality, the country destinations are still discrete and thus can be proved by the central limit theorem (CLT) and the law of large numbers (LLN). The LLN, proved by Jakob Bernoulli in 1713, states that obtaining the average from a large number of trials should be near the expected average value, and that the likelihood of this occurring increases with the number of trials. While similarly the CLT implies that the arithmetic mean of independent random variables, with a sufficiently large number of trials, will be normally distributed regardless of whether or not the data is discrete or continuous, but once it has a defined variance and expected value. This was proven by running numerous tests on samples of the country_destination variable in RStudio and can be seen in the appendix.

In conclusion there was a lot of significant factors which determined the outcome of country destination for the Airbnb dataset. The one-way ANOVA was a good fit to highlight significant differences among variables due to the size and the nature of the data available. However, in terms of any sort of prediction model further testing should be used and the author would recommend non-parametric tests. It also may be advisable to analyse the US on it’s own, by state and then the rest of the world, as there is such a dominance in terms of US destinations. Future recommendations would be to look deeper at the target audience, purpose of trip and occupation.

- “Airbnb” (no date). Avaliable at https://www.airbnb.ie/about/about-us (Accessed: 7/3/16)
- “Airbnb” (no date). Avaliable at https://www.airbnb.ie/press/news/airbnb-unveils-expansive-suite-of-personalized-tools-to-empower-hosts (Accessed: 7 March 2016)
- Bain & Company (2015) “Management Tools & Trends 2015”,
*Bain & Company Inc.,*[Online]. (Accessed: 7 March 2016). - Chen, H., Chiang, R. H. and Storey, V. C. (2012) ‘Business Intelligence and Analytics: From Big Data to Big Impact’,
*MIS quarterly*, 36(4), pp. 1165-1188. - Keim, D., Qu, H. and Ma, K. L. (2013) ‘Big-Data Visualization’,
*IEEE Computer Society,*33(July/August 2013), pp. 50-51. - Newsom, J. (2007). Levels of Measurement and Choosing the Correct Statistical Test.
*Material de Aula.* - Surowiecki, J. (2013). ‘Where Nokia Went Wrong’, The New Yorker, 3 September. Available at: http://www.newyorker.com/business/currency/where-nokia-went-wrong [4 Dec 2015].
- Ziegel, E. R., & Rice, J. (1995). “Mathematical Statistics and Data Analysis.”

**Figure 1: **Test for means of country_destination

**Abstract**

Data analysis is a process which analyses relationships and trends via statistical techniques in order to describe, evaluate and make inferences with relation to data. One such technique that is used to do this is simple linear regression, which uses correlation to assess any linear relationship between two continuous variables (Mukaka, 2012).

*Keywords: *__Anscombe’s Quartet, Pearson’s Correlation Coefficient, Data Visualisation, Correlation__

** **

**Introduction**

*Background*

According to Mukaka (2012), correlation can be defined by Webster’s Online Dictionary as “a reciprocal relation between two or more things; a statistic representing how closely two variables co-vary; it can vary from −1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation)”.

The correlation technique helps to describe the degree by which two variables are related, however it is important to understand that correlation does not imply causation. The key difference between regression and correlation is that regression aims to predict one variable from another and correlation aims to describe the association between the variables. According to some experts there is a serious problem regarding the misuse of correlation, indeed some statisticians wish the term had never been created at all (Mukaka, 2012 and Altman, 1990).

Pearson’s Correlation Coefficient (PCC) is used to examine variables when the population being examined is said to be a normal distribution. The sign of “r” is used to denote the strength of relationship, a “+” symbol indicates a direct relationship (where both variables increase), while a “-” symbol indicates an indirect relationship (both decrease). PCC is measured between +1 and -1 as indicated above. The problem with PCC is that it can be affected by outliers, exaggerating the extent of the relationship, something which Francis Anscombe demonstrated in his 1973 paper (Anscombe, 1973).

Anscombe highlighted the importance of data visualisation by using four datasets with almost identical statistical properties, which showed when graphed, they appear completely different.

*Statement of Purpose*

This report will aim to highlight the importance of data visualization with particular reference to the Anscombe Quartet and PCC.

** **

**Methods**

Four datasets were created based on the principals from Anscombe’s study which required an equal number of data points in each set, an equal mean, variance, correlation and linear regression line. Microsoft Excel and R Studio were used to create the data and plot the graphs in the results section.

**Results**

** Figure 1: **Anscombe’s Quartet plotted in R

** Table 1: **Data used to plot four graphs in

** ****Discussion**

From the results above for Figure 1, it can clearly be seen the difference amongst each of the four graphs. The first graph in the top left appears to display a linear relationship between variables and thus a positive correlation with a normal distribution. The graph on the top right does not follow a linear relationship mean the use of PCC here is no use and a more general regression would be appropriate in this instance. The graph on the bottom left displays a linear model which is just put off by one large outlier, the use of a robust regression model in this instance would be appropriate. Finally the graph on the bottom right displays how one outlier can totally distort the data, when the relationship between the variables is clearly not linear.

**Conclusion**

In conclusion the importance of data visualisation should not be underestimated and provides a good means to identify whether or not the relationship between variables is linear.

**Reference:**

Altman, D.G., 1990. *Practical statistics for medical research*. CRC press.

Anscombe, F. J. (1973) “Graphs in Statistical Analysis”, *American Statistician,* 27(1), 17–21. JSTOR 2682899.

Mukaka, M.M., (2012) “A guide to appropriate use of Correlation coefficient in medical research”, *Malawi Medical Journal,* 24(3), pp.69-71.

Welcome to DBS Data Projects. This is your first post. Edit or delete it, then start blogging!

]]>