Understanding Data

 

Overview

Having recently finished a course in data analytics, I have decided to dedicate this post to an area that is fundamental to the understanding of data analysis and model building itself. This is really aimed at anyone unsure on what data types they are analysing, people who are knew to the subject, or for anyone that needs a refresh on some of the core principles.

Introduction

Have you ever been assigned a task to do, the instructions are clear, “I want you to run a report on………..” and all you have to do is follow them to the letter and the job is done, you can go about your business once again in whatever way you please. People are a funny race in that we are so diverse in our thought process to certain instructions, but in the above example particularly in the case of an assignment, I find there tend to be three main types of approach.

  1. Do exactly what it says on the tin
  2. Something’s not quite right, but I haven’t the time to dwell it on for long
  3. Something’s not quite right….. (24 hours later) WHY DOESN’T THIS MAKE ANY SENSE!!!

Now I just want to say that none of these approaches are more right or wrong than any other. If you fall in to category one, well you have simply done exactly what you were told to do, if there is any fault it is not on your part, but down to the instructions.

If you are part of the second category then you realise that you aren’t quite happy with what you are being told, however you are pressed for time and if you cannot figure it out sooner rather than later then you revert to category one and move on.

I fall into this last category, and a lot of the time I wish I was in one of the first two given the headaches this can cause. You feel something is wrong, but for the life of you, you can’t find the correct solution, all you can confirm is that what you are being asked makes no sense whatsoever and so you are frantically trying to understand which point in the instructions is off and causing you all of this hassle. Actually, the ideal scenario would be to be in category number four, not listed because these are the rare kinds of people who glance at instructions, identify the problem, address it and make it better than it was originally all in the space of a couple of minutes, I’m talking about you Darren!

Anyway, without rambling on any further I would like to talk about the main objective of this blog, to understand data types and the appropriate analytical tests to use them for. For in the field of data science it is unfortunately not always just as easy as doing exactly what it says on the tin, so by getting a solid understanding of the fundamental concepts of data, you can save yourself a big headache in the long run.

Data Types

In relation to data types, they can be split into two categories, discrete data and continuous data. Discrete data, according to Stevens (1946), is concerned with nominal (sometimes called categorical) and ordinal data. Nominal data types can be thought of as categories such as the counties of Ireland or number of students in a class. Ordinal data is similar to nominal except that it has an order or rank to it. For example, ordinal data would be something like a survey asking you to rate your stay as “very poor/poor/ok/good/very good”. These are five categories but there is a ranking to them, very good is better than good etc. However, it is important to note that while they have a rank, they do not have a value, we cannot say that very good is three times better than ok. Some textbooks also use the term dichotomous to describe data, but this is just a term that describes nominal data that has two classes, for example “Male and Female”.

Discrete Data

Figure 1: Discrete Data

Stevens (1946), describes continuous data as being measurable values, with an infinite number of possibilities. It is split into interval and ratio data. Interval data can be thought of as numerical data measured on a scale, for example temperature is a form of interval data. Ratio data is a form of interval data where 0 is not a measurement, it represents nothing. An example would be the height or weight of a person. It can also be ranked on a scale as it has measurable values, 20cm to 10cm has a ratio of 2:1.

continuous data

Figure 2: Continuous Data

Below is a features table to differentiate between the two data types:

dis vs con

Table 1: Discrete vs Continuous Data

The above table lays out the basic principles for each of the two data types however there are a few exceptions to the rule regarding what warrants continuous data types which I will now explain.

Continuous data paradox

One thing that had my head spinning at the time was when I was assigned a task of running an ANOVA test on an Airbnb data set. If you want to read more about this you will find it here:

http://karl.dbsdataprojects.com/2016/04/01/airbnb/

Basically this task required us to run an ANOVA test on a dependent variable which was countries of the world. Now immediately I was thinking there is a problem here as countries are classified as discrete data, they can be counted, they are simply not numeric and even if they were there is no particular order to them? However, according to some statisticians there is a way by organising your data such that some forms of ordinal data can work well to suit continuous tests.

According to Newsom (2007), in instances where there are a lot of categories of ordinal data (greater than 4), the data can be treated as continuous. While this is acknowledged in this particular study Newsom states that it is a topic of debate and so there will not always be full agreement on it, yet it can be a good starting point especially if you are only starting out.

Here is a diagram from Newsom’s paper regarding data types and tests depending upon what type of data your dependent and independent variables fall in to.

1

2

Table 2: Variable Classification and Statistical Procedures

Conclusion

To conclude, a good understanding of data types and how to apply them to certain tests is vitally important to anybody looking to work in the field of data analytics. I would suggest a decent knowledge early on with these principles can help you comprehend things like normal distribution, binomial distribution, parametric and non-parametric testing and enhance your statistical knowledge greatly. Statistics can be fascinating if understood and applied correctly. Otherwise you may end up doing presentations as so…….

stats joke

References

  1. Newsom, J. (2007). Levels of Measurement and Choosing the Correct Statistical Test. Material de Aula.
  2. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680.

Leave a Reply

Your email address will not be published. Required fields are marked *