Data Management and Visualization – Coursera Course
The embedded pdf includes the code along with output.
Let us now discuss it bit by bit ….
The assignment though asks to include 3 variables to pursue the analysis, but I restricted to 2 variables. The data set of GapMinder shared on the course website includes only “Quantitative” Variables, and no “Categorical Variables”. I thought of adding details such as “Continent” as a categorical variable but could not do so due to shortage of time. Similarly I could not find any established bin ranges for “Internet Usage” or “Urban Rate” – the variables I was exploring, as part of the research. Therefore I simply experimented with various “number of bins” to explore the distribution of these variables.
- First – let us look into “Internet Usage“
I am ignoring to explain the details of reading data from a CSV file and importing the required libraries. We start by printing the dimensions of the data set i.e. count of rows and columns
Next we set the variables of interest to numeric, and print a descriptive summary.
As stated in the assignment, “Data management includes such things as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables.Not everyone does all of these, but some is required.”
Let us start with Null Values, it does not make sense to impute the data (fill in the missing values with substituted values such as average of the valid values), therefore all null values are removed.
Frequency or number of data occurrences within each bin is printed next. Furthermore, bin # is printed against each of the data value, i.e. if first value lies in 3rd bin, 3 is printed at first index.
Next a histogram with 5 bins is printed, overlaid with a dashed line indicating “average” in pink color
(Please ignore the title of the graph where it says Histogram with 4 bins)
Next a histrogram with 10 bins is printed.Clearly the skew on the right becomes evident.
Next a histrogram in outline is printed.
After experimenting with 20 bins, a histogram with cumulative probability on the y-axis is is plotted
Second – let us look into “Urban Rate”
In terms of analysis, similar steps are followed as for the first variable. Null Values are removed and then different bins are tried to examine the distribution of the variable
The distribution of the data is more uniform … let us dig more by increasing the bins
The hunch was right, the data is spread “almost” uniformly across the range.
Instead of recoding the variables or creating secondary variables, I preferred to experiment with bin counts, as these provide more visibility and there are no pre-defined ranges which work for “Internet Rate” or “Urban Rate”