Internet Use Rate Vs Urban Rate

Data Management and Visualization – Coursera Course

Assignment 3

The embedded pdf includes the code along with output.

(Coursera-Data Visualization-Assignment 3)

Let us now discuss it bit by bit ….

The assignment though asks to include 3 variables to pursue the analysis, but I restricted to 2 variables. The data set of GapMinder shared on the course website includes only “Quantitative” Variables, and no “Categorical Variables”. I thought of adding details such as “Continent” as a categorical variable but could not do so due to shortage of time. Similarly I could not find any established bin ranges for “Internet Usage” or “Urban Rate” – the variables I was exploring, as part of the research. Therefore I simply experimented with various “number of bins” to explore the distribution of these variables.

  1. First – let us look into “Internet Usage

I am ignoring to explain the details of reading data from a CSV file and importing the required libraries. We start by printing the dimensions of the data set i.e. count of rows and columns

1

Next we set the variables of interest to numeric, and print a descriptive summary.

As stated in the assignment, “Data management includes such things as coding out missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables.Not everyone does all of these, but some is required.”

Let us start with Null Values, it does not make sense to impute the data (fill in the missing values with substituted values such as average of the valid values), therefore all null values are removed.

2

Frequency or number of data occurrences within each bin is printed next. Furthermore, bin # is printed against each of the data value, i.e. if first value lies in 3rd bin, 3 is printed at first index.

3

Next a histogram with 5 bins is printed, overlaid with a dashed line indicating “average” in pink color

output_0_1

(Please ignore the title of the graph where it says Histogram with 4 bins)

Next a histrogram with 10 bins is printed.Clearly the skew on the right becomes evident.

output_0_2

Next a histrogram in outline is printed.

output_0_3

After experimenting with 20 bins, a histogram with cumulative probability on the y-axis is is plotted

output_0_6

Second – let us look into “Urban Rate

In terms of analysis, similar steps are followed as for the first variable. Null Values are removed and then different bins are tried to examine the distribution of the variable

output_0_8

The distribution of the data is more uniform … let us dig more by increasing the bins

output_0_9

The hunch was right, the data is spread “almost” uniformly across the range.

Instead of recoding the variables or creating secondary variables, I preferred to experiment with bin counts, as these provide more visibility and there are no pre-defined ranges which work for “Internet Rate” or “Urban Rate”

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s