The Power of R

R_1

R is an open source statistical programming package that incorporates graphical tools to present your data.  It has become one the most used business tools for computing statistical information over the past decade.  With R you can infiltrate any data format, from .CSV to .SAV.  This is why it has become the tool used by most data scientists and it’s free.  Another important thing of R is you don’t have to be a programming wizard to work with it.

Here is an excellent article on R from January 2009, New York Times by Ashlee Vance.

Jumping in

The best way to learn to swim is to jump in and that is what I did with R.  Using the tutorial site Try R it was a gradual introduction to the useful functions of the programming language, as well as being a bit of fun.  There are 7 levels in the tutorial with each level introducing a function with easy to follow examples.  After learning the basic skills to use R without getting wet, let’s see if it can answer one of the most asked questions, Who is better Messi or Ronaldo?

Project_R

WHO IS BETTER – MESSI or RONALDO

  Lionel Andres Messi Cristiano Ronaldo
Born 24th June 1987 (28) 5th February 1984 (31)
Height 1.70m 1.85m
Team Barcelona Real Madrid
Total Goals 2009 – 2015 232 225

Nobody can dispute the quality of these fantastic players.  So to break it down we are doing to look at their goal scoring ability, how they assist their team and how long it takes them score one goal.  So lets have a look at each player.  For this study we will be only looking at data from the 2009/10 season onwards.  That is the year Cristiano Ronaldo joined Real Madrid.

GOALS PER SEASON
It truly is amazing looking at the fire power of each of these players.  The number of goals that each of them score in one season most centre forwards wouldn’t score over two.  You also have to take into account that the graph below only shows the goals that were scored in La Liga for each season from 2009 to 2015.  For the time period covered Messi has scored 232 goals and Ronaldo has 225.  So it’s 1 nil to Messi
Total Goals Scored

 

HOW OFTEN ARE THEY LIKELY TO SCORE

Here we look the likelihood of each of them scoring over a 90 minute period and again they break the norm.  On average most world class strikers have a goal rate of over 90 minutes.  Sergio Aguero, who finished top goal scorer in the Premier League for 2015 with 26 goals averages a goal every 98 minutes.  But the machines of Messi and Ronaldo each average a goal every 78 minutes.  It seems the only way to stop them from scoring is not let them play!!!!!  Its draw on this one.  Messi still 1 nil up.

Average MinGoal

NO “I” IN TEAM

We come down to the final comparison between the two greats and very little separates them. It’s not all about scoring goals especially if your having an off day.  So lets have a look at how they help their team mates outs.  The graph below shows how many assists each player has had in each of the seasons covered and it shows the their really is an I in team and that being Mess(I).  He is a clear winner by contributing an average of 17 assists per season to Ronadlo’s 13.  So based on our analysis Messi is the better player.  You don’t have to agree!!!

Number of Assists

Below is a heat map showing the full dataset for each player.

Messi_Heat

DATA QUALITY – YOU CAN’T PUT DIESEL IN A PETROL ENGINE

YOU CAN’T PUT DIESEL IN A PETROL ENGINE

Data Quality

So how good is your data and how are you using it?  Like any machine it is only as good as the fuel (data) you put into it.  You won’t get very far putting diesel into a petrol engine.  The same can be said of your database(s).  Your data return will only be useful if you have good input.

Quality information is defined as “information that is suitable for all of an organisation’s purposes, not just my purposes.”

What are the steps to securing good data?

  1. One of the key factors of good information is its relationship with the business in question. Making sure your data links up and answers the questions you want answered in order to provide the service required or the relevant information to your audience.
  2. Being aware of the data you are collecting is an essential part of good data. Having process in place where data can be checked, cleansed and edited will affect your data quality. Allowing editing or checking without parameters in place will place your data in danger, this can lead to loss of good data on one persons judgement.
  3. Discarding documentation and design standards is a large problem. Over time data quality guidelines are discarded through employee turnover and data familiarity. The data can become “When do we ever use this”. If point number one is adhered to, it is useful information for some part of the business. To prevent this frequent training and updating of data guidelines is necessary.
  4. The main link of all the points above is communication. Without good channels of communication your data quality will suffer. This has to start from the outset involving all who will be using the data and supplying the data. This can seem like the most logic step but it can be the biggest hurdle to getting good information.

When dealing with poor data quality, Marsh (2005) summarises the findings from various industry research as follows:

“88 per cent of all data integration projects either fail completely or significantly over-run their budgets”

• “75 per cent of organisations have identified costs stemming from dirty data”

• “33 per cent of organisations have delayed or cancelled new IT systems because of poor data”

• “$611bn per year is lost in the US in poorly targeted mailings and staff overheads alone”

• “According to Gartner, bad data is the number one cause of CRM system failure”

• “Less than 50 per cent of companies claim to be very confident in the quality of their data”

• “Business intelligence (BI) projects often fail due to dirty data, so it is imperative that BI-based business decisions are based on clean data”

• “Only 15 per cent of companies are very confident in the quality of external data supplied to them

• “Customer data typically degenerates at 2 per cent per month or 25 per cent annually”

• “Organisations typically overestimate the quality of their data and underestimate the cost of errors”

• “Business processes, customer expectations, source systems and compliance rules are constantly changing. Data quality management systems must reflect this”

• “Vast amounts of time and money are spent on custom coding and traditional methods – usually fire-fighting to dampen an immediate crisis rather than dealing with the long-term problem”

Data Quality 2

 

What Can Quality Data Do?
Good data is all about providing the tools to get information that will assist in making good decisions. With good quality data you can

  1. Gain information
  2. Gain knowledge
  3. Make decisions
  4. Get results

With the world of data changing rapidly the level of data being collected has increased immensely.  As this happens the levels of data become more difficult to manage and this can allow the quality levels to drop.  It has been proven through various research that good data quality can improve customer satisfaction, decrease running costs, assist in more efficient decision making and increase employee performance and job satisfaction (Kahn et al., 2003; Leo et al., 2002; Redman, 1998).

Data Quality 3

 

FUSION TABLES

Fusion Tables

Your data
Fusion tables are used to combine tables and present your data in a more meaningful manor. Obviously you have to get your initial information from somewhere. For this example I used the CSO (Central Statistics Office) data from the 2006 Census and 2011 Census.  For each year I looked at the total population and the population aged 15 and over that had lost or had given up their previous job.  To present this data more graphically I also required a table containing geographical information for the counties of Ireland.  This file was available on the Irish Independent website.  A link to all datasets used can be found below.

Depending on your data source you may need to clean your data. Once you have a clean data file you need to save the file as a .CSV (Comma Separated Values) file.

Data Links
http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/
http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2006/

Irish KMZ Datafile – There is no need to save this file in CSV format. http://www.independent.ie/editorial/test/map_lead.kml

Analysis
Analysing your data will explain what the data is saying. This can be simple calculations looking at the average or medians (mid-point) of some parts of your data, which makes the data more understandable for your audience. The analysis below compares the population in Ireland in 2006 and 2011 showing the percentage increase in each county.  A further analysis also compares the unemployment rates in each county for the same time periods.

To calculate the unemployment rate I also needed the total population that was eligible to work.  This information can be found on the CSO website.

Creating Fusion Tables

  1. You need to have a Google Drive account.
  2. Add the Fusion Tables app
    Settings, Manage Apps, Connect to more apps
    FT1
  3. Create Your Table
    i) NEW
    ii)More
    iii) Google Fusion table
    FT2
  4. After following the steps above you will be prompted to import and name the relevant files.FT3
  5. Repeat steps 3 & 4 above for each table.

Merging Tables
Follow the steps below to merge your data tables.

  1. Open one of the tables you wish to merge.
  2. From the “File” menu choose Merge.
  3. Select the second table you want to merge.
  4. Select the variables that are matching, this is how the tables join.
  5. Choose which variables you want displayed in the new merged table.
  6. Fusion Tables automatically create a new table containing the new merged data.

Merge1

After merging the total population data with the Counties KMZ file you can show the population of the country by the size of the population in each county by assigning a colour code. See image below.  The image below shows two heat maps created with Fusion Tables, each showing the population by county, one for 2006 and the other 2011.

2006 V 2011 - Population
2006 V 2011 – Population

Editing the Map
The colour scheme is assigned to each county based on the size of its’ population. This can be edited using the Change Map Function on the Map of Geometry tab. From there you can assign the ranges and colours you wish to use.

POPULATION 2006 v 2011
The maps above are showing the population spread by county in Ireland using the CSO Census data from 2006 and 2011. In 2008 Ireland suffered a huge property and financial crash triggering one of Ireland’s worst recessions, a recession that Ireland is only starting to come out of seven years later.  During the time period 2008-2014 it has been widely reported that during the recession emigration levels has depleted the smaller counties of Ireland.  But looking at the census data from 2006 and comparing it with 2011, there was an average national increase of 8% across all counties.  Laois recorded the highest level of increase at 20%.

In 2006 over one in three counties had an average population of between 54,000.  In 2011 this dropped to just over 25%.  Excluding Dublin and Cork, who have a combined population of 1.8 million, the remaining counties have a median population of 136,640.

 

Unemployment

As mentioned above the Irish population had an average growth rate of 8% between 2006 and 2011.  When comparing the Population Aged 15 Years and Over, at the time of the 2006 census, the Irish economy was booming and employment rates were high.  When the bubble burst in 2008 the small towns and counties where hit the hardest with job losses.  In 2006 the country had an unemployment rate of 4.9% by the time of the 2011 census the rate had more than doubled to 13%.  The map and chart below compare the numbers of those unemployed in 2006 and 2011 by county.  The contrast between the two is staggering.  Counties saw their unemployment numbers more than double, in some cases they trebled.  Roscommon in 2006 had registered unemployed population aged 15 years and over of 1,385 by 2011 this figure was 5,409, an increase of nearly 400%.  The average national increase was 290% for the time period covered. This was devastating to rural areas.

Unemployed 2006 v 2011