The purpose of this report was to analyse the Airbnb dataset to see if there were any factors that may affect the choices with regards destination outcome. Data was gathered from kaggle.com and scrubbed and cleaned before inferential statistical tests such as one-way analysis of variance (ANOVA) was run on a set of independent variables, with a significance level set to p < 0.05. Statistical analysis was carried out using RStudio version 3.2.3 and all figures and graphs were done using this software. This paper was about exploratory data analysis and quality of testing, with no reference to any predictive analytics. However, it is the view of the author that a predictive model would be a good addition to the data gathered from the results section. There was high levels of significance found across the independent variables and conclusions were drawn appropriately from this information.
Keywords: Airbnb, ANOVA, Data Analytics, RStudio, Data Quality
Founded in 2008, Airbnb connects over 60,000,000+ people across 34,000+ cities in 190+ countries across the globe (“Airbnb”, no date ). In November 2015 Airbnb announced that they would be using an “Expansive Suite of Personalized Tools to Empower Hosts” (“Airbnb”, no date ). Essentially what the company aim to achieve in the coming year is a smoother transition from web searching to destination arrival. They seek to do this by using these tools to better empower hosts and to personalise their individual needs. The business model aims to achieve a more customer friendly environment by allowing hosts to spend more time connecting with their guests than they do managing their listings. According to Joe Zadeh, the VP of Product at Airbnb, the team have spent time and research into understanding the needs and requirements of the hosts through feedback and testing. Joe highlights the importance of host input by acknowledging it is a “company founded by hosts and led by hosts” and that tailoring their company towards the needs of the user will ultimately provide a greater product.
Data analytics is a part of almost every organisation in the modern world, whatever the area is there is almost certainly an element of data analysis at the core of its industry. Using the analysis appropriately however, is what separates a successful organisation in the ever expanding era of big data (Chen and Chiang, 2012 and Keim and Qu, 2013).
According to a Bain & Company (2015) report, trends over the coming years indicate a strong investment in digital transformation to confront risk, fuel innovation and growth, and to master complexity. 64% of the executives interviewed were unanimous in their thinking regarding spending on IT, indicating that it must begin to rise with regards to percentage of sales. The report highlights a positive trend in relation to profit as a consequence of IT systems and also, a majority of the surveyed population acknowledge they are using advanced analytics for business transformation and marketing strategy. Big data analytics are reported to help establish valuable insight and fuel growth from large volumes of data with industries such as healthcare, telecommunications, financial services and technology all acknowledge the application of such techniques and tools to enhance quicker and better decision making.
Surowiecki (2013) reported in the New Yorker business section the end of an era for a gargantuan player in the mobile phone industry. Nokia’s stock was sold to Microsoft for a fifth of the value it had been worth in 2007. The report indicates how Nokia, at their peak, were earning over 50% of all profits in the mobile phone industry. By the time they were selling they were earning below 3%. A key factor that the company acknowledge was the failure to design a good enough smartphone suitable to the user. So although they were aware the times were changing, the R&D department couldn’t strategically analyse what the consumer wanted, and the android and apple technologies took the market by storm.
- Statement of Purpose
The purpose of this investigation was to apply the use of data analytics to an Airbnb sample customer data set with the intention of identifying any interestingness within the population data.
Data was collected from an Airbnb .csv file for a total of 213,451 subjects. All data submitted to Airbnb was done so by participant’s agreement to the company policy. Participants were selected based on global samples of anybody who completed the online sections of booking on the Airbnb website.
In order for subjects to be valid for selection, the data needed to be scrubbed and organised so that participants ranged in age from 18-80 and gender selection was either male or female. The country_destination column was edited to remove all “NDF” values and leave a total of 11 possible destinations. After this process a valid population size of 55,371 participants remained.
- Statistical Analysis
A one-way analysis of variance (ANOVA) was used to determine any possible significance between groups, with country_destination set as the dependant variable. Weighted means were taken from each of the 11 independent variables across the 11 possible country destinations to perform the analysis, with significance set to p < 0.05.
- Tool Selection
RStudio version 3.2.3 was used for all data cleaning and statistical analysis.
Participants had to fit the selection criteria outlined in section 1. of the Methods section.
Figure 1: One-way ANOVA
A one-way ANOVA was run between 11 variables, with the dependant variable set as country_destination. There was a significant difference between age and country_destination (F = 49.436, p < .05). Similarly other independent variables such as signup_flow (F = 60.331, p < .05), affiliate_channel (F = 27.812, p < .05), affiliate_provider (F = 2.577, p < .05), first_affiliate_tracked (F = 2.016, p < .05), signup_app (F = 13.286, p < .05), first_device_type (F = 3.403, p < .05) and language (F = 3.541, p < .05) all displayed a significant difference between groups with relation to country_destination. Gender, signup_method and first_browser were not significant below the p value in terms of country_destination.
Based on the results of the ANOVA it can be observed that there is quite a number of significant observations within the tested data. There were large statistical significance found in 8 of the 11 variables tested. In relation to the age variable this is unsurprising, as there is quite a large range (20-80) in terms of participant age meaning that some destinations were more visited than others at certain ages. Gender was perhaps another interesting variable that could have been significant. In this case it was not significant below the p < 0.05 mark, however it would have been below p < 0.10 so was quite close to the mark. The reasons for this being the case could be down to the fact that a lot of couples would tend to travel together and that over the spread of 11 country destinations there would expect to be more of an even effect amongst gender, especially given the fact that only 10 countries were classified, meaning the other category could be anywhere else in the world. Perhaps a future study could examine gender in closer detail by examining trends on a smaller scale such as destinations within Europe for males and females under 40. In relation to the significant differences in terms of methods of signing up, device type, browser, language and provider, this was again as expected probably as a consequence of age, personal preference and country of origin. If windows laptops sell for cheaper than macs in poorer countries then it’s a fair assumption that the majority of that population will use a windows product or that windows based systems will be used amongst internet cafes in this area and vice versa. Again the likes of Facebook joining, versus basic signup methods can be argued down to a generational difference and the likelihood of internet browser is potentially down to whichever is the custom built in one on a subject’s laptop or desktop.
With regards the data selection process, of the 213,451 entries that were recorded, only 25.94% of this data was used for the results section of this study for three main reasons. The age category had some values in the thousands that had to be stricken from analysis immediately, probably because some of the participants were entering dates of years rather than age, as there was a high number of 2014 and 2015 entered in the age category. Similarly with the gender category it was broken into four parts, and finally the inclusion of all of the NDF for country of destination was entirely useless as the dependant variable of country destination needed to be specified for it to be used in the study. For any future research into country destinations the author proposes a max age limit of 100, do not allow NDF selections to register and remove the option of both “other” and “unknown” in the gender category.
The one-way ANOVA test was chosen to statistically analyse the data for several reasons. The dependant variable used in trials was country_destination, in some cases it would be considered a form of discrete data due to the fact that countries can be classed as nominal measures as there is no particular rank order between them. However, in instances where there are a lot of categories (greater than 4), the data can be treated as continuous, though it is acknowledged that this is a case of some debate amongst statisticians (Newsom, 2007). The dependent variable can also be classified as having a normal distribution, (see Appendix I for means sample testing) yet while the data can be treated as continuous, to test for normality, the country destinations are still discrete and thus can be proved by the central limit theorem (CLT) and the law of large numbers (LLN). The LLN, proved by Jakob Bernoulli in 1713, states that obtaining the average from a large number of trials should be near the expected average value, and that the likelihood of this occurring increases with the number of trials. While similarly the CLT implies that the arithmetic mean of independent random variables, with a sufficiently large number of trials, will be normally distributed regardless of whether or not the data is discrete or continuous, but once it has a defined variance and expected value. This was proven by running numerous tests on samples of the country_destination variable in RStudio and can be seen in the appendix.
In conclusion there was a lot of significant factors which determined the outcome of country destination for the Airbnb dataset. The one-way ANOVA was a good fit to highlight significant differences among variables due to the size and the nature of the data available. However, in terms of any sort of prediction model further testing should be used and the author would recommend non-parametric tests. It also may be advisable to analyse the US on it’s own, by state and then the rest of the world, as there is such a dominance in terms of US destinations. Future recommendations would be to look deeper at the target audience, purpose of trip and occupation.
- “Airbnb” (no date). Avaliable at https://www.airbnb.ie/about/about-us (Accessed: 7/3/16)
- “Airbnb” (no date). Avaliable at https://www.airbnb.ie/press/news/airbnb-unveils-expansive-suite-of-personalized-tools-to-empower-hosts (Accessed: 7 March 2016)
- Bain & Company (2015) “Management Tools & Trends 2015”, Bain & Company Inc., [Online]. (Accessed: 7 March 2016).
- Chen, H., Chiang, R. H. and Storey, V. C. (2012) ‘Business Intelligence and Analytics: From Big Data to Big Impact’, MIS quarterly, 36(4), pp. 1165-1188.
- Keim, D., Qu, H. and Ma, K. L. (2013) ‘Big-Data Visualization’, IEEE Computer Society, 33(July/August 2013), pp. 50-51.
- Newsom, J. (2007). Levels of Measurement and Choosing the Correct Statistical Test. Material de Aula.
- Surowiecki, J. (2013). ‘Where Nokia Went Wrong’, The New Yorker, 3 September. Available at: http://www.newyorker.com/business/currency/where-nokia-went-wrong [4 Dec 2015].
- Ziegel, E. R., & Rice, J. (1995). “Mathematical Statistics and Data Analysis.”
Figure 1: Test for means of country_destination