Data analysis is a process which analyses relationships and trends via statistical techniques in order to describe, evaluate and make inferences with relation to data. One such technique that is used to do this is simple linear regression, which uses correlation to assess any linear relationship between two continuous variables (Mukaka, 2012).
Keywords: Anscombe’s Quartet, Pearson’s Correlation Coefficient, Data Visualisation, Correlation
According to Mukaka (2012), correlation can be defined by Webster’s Online Dictionary as “a reciprocal relation between two or more things; a statistic representing how closely two variables co-vary; it can vary from −1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation)”.
The correlation technique helps to describe the degree by which two variables are related, however it is important to understand that correlation does not imply causation. The key difference between regression and correlation is that regression aims to predict one variable from another and correlation aims to describe the association between the variables. According to some experts there is a serious problem regarding the misuse of correlation, indeed some statisticians wish the term had never been created at all (Mukaka, 2012 and Altman, 1990).
Pearson’s Correlation Coefficient (PCC) is used to examine variables when the population being examined is said to be a normal distribution. The sign of “r” is used to denote the strength of relationship, a “+” symbol indicates a direct relationship (where both variables increase), while a “-” symbol indicates an indirect relationship (both decrease). PCC is measured between +1 and -1 as indicated above. The problem with PCC is that it can be affected by outliers, exaggerating the extent of the relationship, something which Francis Anscombe demonstrated in his 1973 paper (Anscombe, 1973).
Anscombe highlighted the importance of data visualisation by using four datasets with almost identical statistical properties, which showed when graphed, they appear completely different.
- Statement of Purpose
This report will aim to highlight the importance of data visualization with particular reference to the Anscombe Quartet and PCC.
Four datasets were created based on the principals from Anscombe’s study which required an equal number of data points in each set, an equal mean, variance, correlation and linear regression line. Microsoft Excel and R Studio were used to create the data and plot the graphs in the results section.
Figure 1: Anscombe’s Quartet plotted in R
Table 1: Data used to plot four graphs in
From the results above for Figure 1, it can clearly be seen the difference amongst each of the four graphs. The first graph in the top left appears to display a linear relationship between variables and thus a positive correlation with a normal distribution. The graph on the top right does not follow a linear relationship mean the use of PCC here is no use and a more general regression would be appropriate in this instance. The graph on the bottom left displays a linear model which is just put off by one large outlier, the use of a robust regression model in this instance would be appropriate. Finally the graph on the bottom right displays how one outlier can totally distort the data, when the relationship between the variables is clearly not linear.
In conclusion the importance of data visualisation should not be underestimated and provides a good means to identify whether or not the relationship between variables is linear.
Altman, D.G., 1990. Practical statistics for medical research. CRC press.
Anscombe, F. J. (1973) “Graphs in Statistical Analysis”, American Statistician, 27(1), 17–21. JSTOR 2682899.
Mukaka, M.M., (2012) “A guide to appropriate use of Correlation coefficient in medical research”, Malawi Medical Journal, 24(3), pp.69-71.