Saturday, March 30, 2019

Programming for BIG Data Project

Programming for BIG selective information ProjectLiliam FaraonNowadays, the heart and soul and soul of schooling generated and stored with disclose an operation has exceeded a entropy analysis capability without the do of automated analysis techniques. The exponential growth of selective information is greater than it has of all time been seen, extracting useful culture from all the info generated and transform it into understand satis positionory and available information is the challenge. There is where selective information mining assumes an essential role, plenty of tools be available for data mining tasks using artificial intelligence, algorithms, machine accomplishment and umteen some others. In the present work cardinal datasets were analysed, one with R and the other one Python. All the analysis was found in the CRISP-DM raw material concepts Business Understanding, Data Understanding, Data conceptualization, Modelling, Evaluation and Deployment.The full methodology was non applied in the project, but understanding parts of its process was fundamental, the travel be pretty straight forward and give a genuinely good idea of e real st bestride that data mining has to go through and the feedback brought from e really stage.The project scope is contain to observeing patterns in the data sooner than indicateing future, which could be examined as part of further study of the quash matter.The present Project was divided into two different parts set out 1 R Dataset Analysis and Part 2 Python Dataset Analysis. It contains in like manner a brief contextualization about the Big Data Context and the importance of data mining.We live in a time when the pursuit of acquaintance is indispensable. Today, information assumes a growing importance, and a necessity for any field of charitable activity, due to the many a nonher(prenominal) transformations we are witnessing. At every moment, we are lining reinvigorated concepts and trends and we are amazed at how quickly they are occurring and touching our lives, such as the technology that influences all domains, social environments and touches every occupation and life on the innovationet.The bind written by Bernard Marr, and published by Forbes suffer social class brings some statistics that convince that big data in reality needs attentionMore data has been shitd in the past two class than in the entire history of human raceBy 2020 near 1.7 megabytes of parvenue information will be generated every snatch for every human being on the planet.Every second we create new data, a good example scarcely on Google 40.000 searches and queries are generated every second, which makes the huge amount of 1.2 trillion searches a course of instruction.Facebook users send on bonnie 31.25 one thousand thousand messages and view 2.77 million videos every minute.Only in 2015, 1 trillion photos were taken and billions of them were dispensed on line.In 2015, everywher e 1.4 billion smart phones were shipped, all capable of collecting different sorts of data and by 2020 the domain of a function will surrender over 6.1 billion smartphone users globally. in spite of appearance vanadium old age on that point will be over 50 billion smart connected devices worldwide, all developed to collect, analyse and share data.Retailers that leverage the full power of big data would be able to increase their operating by as much as 60%.Now, only less than 0.5% of data is analysed.All the Big Data generated, break some characteristics Rapid increasing volume, variety, velocity and data storage and transfer, fabrication and analysing it all became a huge challenge, but by using special(prenominal) programs knowing to analyse the information on algorithms based will spank the challenges and the output can be apply to enable the decision-making process.For the R Project, a very specific database was analysed Tourists Visiting the sulfur of Brazil, The inf ormation was obtained in the authorities website, in the Tourism division.1.1 Business UnderstandingTourism is an cardinal sector that has an rival on development of nation economy. For many appearries, the tourism is the some important source of income and jobs generation. Brazil is the fifth biggest untaught in the world with 8,511,965 sq km of area and the nation is divided into 5 regions North, Northeast, Central-West, southwardeast and sulphur Regions. The Best in Travel 2014, by Lonely Planet drive classified Brazil as the best tourist destination in 2014.According to the official Brazilian Tourism Website Around 6 million plurality visit the country every year, it is considered the main touristic market in atomic number 16 America and the second in Latin America. It is estimated that only around 17% of all tourists visiting Brazil go to the South region, make up by three States Parana, Rio Grande do Sul and Santa Catarina.Having in mind those numbers and the knowle dge that the most visited places in Brazil do non include the South of the country a dataset was analysed to get some information and find out how many visitors adopt been there and where they were from.1.2 Data UnderstandingSource data http//www.dadosefatos.turismo.gov.br/estat%C3%ADsticas-e-indicadores.htmlFormat csv, comma-separatedSize 3.46MB spell of rows 73.392Columns1 Continent2 state of matter3 State4 Year5 Month6 CountThe technologies utilise were leap out and R Studio.1.3 Data PreparationThe first base downloaded version had 534.792 rows, it included the tourism information from all the 26 states and it was based on data from 1989 to 2015. It was a quite huge dataset that would not be convenient to extract useful outputs as Brazil had been through many economic and social changes in this period. Excel was used to turf out the information from other states as well as the years onwards 2005.As the dataset was all provided in Portuguese Language the code was used to accelerate visualisationThe next step was looking at the data, for a damp understanding, Dimensions, Names, Classes and Summaries codes were writtenResultsSome table codes were written to count for each one conclave of factor levelsResultsThe code round was run to specify number of decimal fraction placesResults1.4 ModellingA Linear Model was written to generate a better data visualization and analysis of variance Some graphs were generated to have a better understanding about how many tourists visiting each of the statesA Bar maculation was generated for better visualisationThe same parameters were used to generate pie chartsParana with 33,01% and Santa Catarina with 29,48% have a very similar number of visitors and Rio Grande do Sul is the most visited place with 37,51%. With a little bit of search the mortala can be understood, as Rio Grande do Sul is the larger of the three states, having more than(prenominal) options for the visitors and Some of the biggest manufac turing industries factories in the country are located in that area.after visualizing where the tourists go it is important to know where they come from. For that reason, some graphs were also generated vividThe same parameters were used to generate some other in writing(p)s afterward analysing isolated information, a graph relating year and states was generatedIt was also generated a graphic listing all countries that visited the South of Brazil in the periodA flowchart was designed to represent the algorithm workflow process Preparing data for a plot1.5 EvaluationCompiling the dataset into graphics and tables facilitated data visualization and brought some very important evidence that can be used for many purposes, specially merchandising reasons, on defining an action plan based on what can be done to bring more tourists to the south region.The graphs showing the percentages of tourists, were the ones that caught the attention, Europe had the larger number of visitors with 37,7% , followed by South America with 22%, Asia with 11,7%, Africa with 9,2%, Central America and Caribbean with 8,8%, North America with 5,5% and at last Oceania with 5,1%.Looking at these proportions a few questions were raised and research was necessary. Some important facts showed up the dataset brings only the number of people traveling for leisure purposes, it does not count the amount of people on business, with could impact on the numbers, especially from North America, as many of them visit the country for business purposes and extend their stay on holidays. Another very important factor is that the information was collected in the first stop in the country, and all the three states in the South do not have a large airport, usually they arrive by connection flights coming from So Paulo or Rio de Janeiro, where the main international airports are situated. The last very important element that could impact on the number of visitors, is the fact that the south of Brazil does not have a tight lock of their borders and many people arrive by land, usually driving from other countries in South America.As said before the tourism sector can be very explored and it can impact in the tax r notwithstandingue generation. According to the International Congress Convention Association (ICCA) Brazil is the armament of many international events in Latin America and the seventh in the world, so why not leverage on the information brought and take in all those events to the South of Brazil?The numbers in the dataset look a bit too similar for every year related to the count of people visiting the states, but anyway it provides very useful information. It is also very important to observe that Brazil is also accessed by sauceboat and land, specially by tourists coming from Central and South America, as there is no border control some of the numbers might be slightly different.The project scope is limited to identifying patterns in the data rather than predicting futur e which could be examined as part of further study of the conquer matter.2.1 Business UnderstandingEvery time a famous person passes away the media makes discussion some deaths even take the elements of scandals, especially when there is the suspect of a self-destruction, people follow the reports all over the world.The year of 2016 seemed to be very sad for the famous people, with an unusual number of deaths observed. An clause from the 22nd of April, 2016 on BBC News website reported that by April the number of celebrities deaths was two-base hit as the antecedent years, and even said the number of significant deaths this year has been phenomenal. But comparison to the years before, is it true?Based on a dataset available on kaggle.com, that compiled information available on wikipedia.org, some questions were asked Did more celebrities die in 2016 than in the last 5 years? Was felo-de-se the most gain of deaths? What were the reasons for the deaths in 2016? Were the reasons different from the 5 years before? What would be the main travails of death for each age throng?2.2 Data UnderstandingSource data https//www.kaggle.com/hugodarwood/celebrity-deathsFormat csv, comma-separatedSize 1.47 MBNumber of rows 14.880Columns1 age2 birth_year3 evidence_of_death4 death_ month5 death_year6 famous_for7 name8 nationalityThe technologies used were Excel and Python 3.62.3 Data PreparationThe original downloaded version had 21.562 rows, with a quick look through the data, a few abnormalities were shown, a number of duplicated cells and rows was observed, also some birth_year did not hit to real birth year, there were also some animals among humans (specially racehorses and dogs). Excel was used to exclude the duplicated data, to clear some odd information and to exclude the deaths from 2006 to 2010, as the project idea was analyse only the past five years.The first step was rendering the table through pandasLooking at the classes and missing valueAs it is clear t here are many missing values of cause of death.Looking at the most park causes of death* It seems like many celebrities tend to die from cancer and heart failure.2.4 ModellingA metre plot was generated for better visualizationThe article from BBC was not entirely wrong, in 2016 more famous people died, compared to the 5 previous years.Looking for the answer for the second question, a chuck out plot about the suicide rates was generated, was suicide the main cause of deaths?It cannot be said that suicide was the main reason for the deaths.As seen on the previous graphic there is a percentage of celebrities that commit suicide, but comparing 2016 to the five previous years and comparing with natural deaths, a new bar plot was createdCompared to the previous years, 2016 did not seem as bad as the papers and social media claim, as the self-destructive rate was only higher than 2014, in this way it cannot be affirmed that the main cause of celebrities deaths in 2016 was self-murder. middling for information a graphic was created to illustrate which is the month when more famous people tend to take their livesAs the bar plot displays September is the month showing a highest level of suicide, patch June appears as the lowest.The figures generated from the data set brought a few information so far, proving that 2016 was a sad year for famous people, it also showed that suicide was not the main cause of death. To find out what the main reasons were a bar plot was createdAppears that cancer killed more famous people, at least in the year of 2016.Still comparing 2016 to the five years before an average number of deaths by cause was called, to investigateThe comparison shows that compared to the five years before more famous people died due to more crabmeat and Traffic collision, all the other reasons seem to follow the same pattern.Just out of curiosity and to have a better understanding from the facts, the dataset was reason into age groupsSome pie charts were cr eated to illustrate the cause of death by age groupIt is very important to bring to attention that in the electric shaver group there were only five rows and that is why the percentages are very high.It is very challenging trying to analyse the deaths related to the age group as there were many missing data specially when it comes to cause of death. As a matter of fact, as common sense, the older people get the age-related diseases appear more in the graphics.A flowchart was designed to represent the algorithm workflow process In cause_of_death column = suicide2.5 EvaluationCompiling the dataset into graphics and tables facilitated data visualization and brought some very important information about the celebrity death from 2011 to 2016. The missing values made the difference when trying to get deep information, especially when it comes to cause of death.It was pretty obvious from the data that 2011 the number of dead famous per year increased slightly, however not all the celebrit ies in the list would very be considered as such by many people. It was cleared that the suicidal rates are not as high as the media claims and it is not the main cause of death andThe increase in the number of news about famous peoples death can also be natural event because more people have access to the internet, social media and seem to twaddle more about it.It is important to remember that the project scope was limited to identifying patterns in the data rather than predicting future.I could not say it was an prosperous task choosing and analysing two datasets. As I am not a student with any IT background some of my ideas as an outsider were solely mistaken, as I did not know how difficult it can be to write codes and get information from the datasets. It took me a while to understand the fundamentals of how the Python an R work, and I consider I have done a good work.I can tell that I went through an undreamed of learning journey since I catched the Data Analytics cour se at field College and I have learned a huge volume of new skills. To get the present project done I watched uncountable number of videos, I tried many different environments until I felt comfortable to start the project itself, it also took me a while to find the just dataset and the right questions, but after seeing the graphics and tables I realised I could really get through and do a good project.As our course dedicated more time to Python and have always reading about R as a very difficult data analytics tool I confess I was terrified about it, that is why I decided to start the R Project first, but I had a very good surprise, the program is easier to use than I thought, even with my very little knowledge. Working with a dataset that I am familiar with made it simpler as well, I have always worked in marketing environments and had the curiosity to know more about tourism in the South of Brazil, where I was raised. I consider I found out important information, that maybe coul d be very valuable for companies investing in run and tourism.For the Python project, I decided to work with the celebrity-deaths dataset just out of curiosity, as nigh every single day during the year of 2016 I saw on twitter the celebritydeaths2016. But after analysing the dataset I found out that there is only a slightly evidence that more famous people died during the year of 2016 it cannot be said that it was the worse year or predict anything for the future. I have also found out that suicide is not the main reason for their deaths as the social media reports.The idea of both projects was to identify and extract patterns in the data, which I believe has happened.ReferencesBig Data 20 Mind-Boggling Facts Everyone Must Read. Available at http//www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/56eaf8456c1d. Accessed 10 celestial latitude 2016.Business Dictionary. Available at http//www.businessdictionary.com. Accessed 09 celestial la titude 2016.Estatsticas e Indicadores. Available at http//www.dadosefatos.turismo.gov.br/dadosefatos/home.html Accessed 09 December 2016.Lantz B., 2013, Machine Learning with R, Packt Publishing IBM, 2011, IBM SPSS Modeler CRISP-DM Guide, IBM Corporation. Available at http//www-staff.it.uts.edu.au/paulk/teaching/dmkdd/ass2/readings/methodology/CRISPWP-0800.pdf Accessed 11 December 2016.Ministrio do Turismo. Available at http//www.turismo.gov.br/ Accessed 19 December 2016.Skill Data Analysis. Available at https//15-5103.ca.uts.edu.au/skills/data-analysis/ Accessed 09 December 2016.Why so many celebrities have died in 2016? Available at http//www.bbc.com/news/entertainment-arts-36108133 Accessed 26 December 2016.Source data http//www.dadosefatos.turismo.gov.br/estat%C3%ADsticas-e-indicadores.htmlSource data https//www.kaggle.com/hugodarwood/celebrity-deaths

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.