GapMinder Dataset: Exploratory Analysis

Three numeric variables were selected: lifeexpectancy, employrate,internetuserate. First a subset of the dataset was created, since we were interested primarily in the countries with lifeexpectancy less than equal to 70. All the exploratory analysis were done on this subset.

Since the variables selected are continuous, each variable was binned (grouped) into a few bins (e.g., employrate and lifexpectancy variables were binned into 4 equal-depth groups whereas the internetuserate variable was binned into 3 equal-width groups). For each of the variables, the frequency tables were computed and the missing values were coded out. We tried to understand how the lifexpectancy varied across the countries in different groups of employrate and internetuserate.

The frequency distributions of the managed (grouped) variables are shown below. As can be seen, number of countries with (low) employrate less than 56 is 19, whereas number of countries with (high) employrate greater than 70 is around 18. Also, it shows that there were 3 countries with NAN values for employ rate in the subset.


The below heatmap of the crostab shows that there are relatively high number of countries with low intermetuserate but high employrate!


The scatter plot of the variables employrate vs. internetuserate above shows that for the countries with lifeexpectancy > 60 we have higher internetuserate in general. The box plots below show that the lifeexpectancy is highest on average for the countries with second lowest employrate but for those with the highest interntuserate.





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s