Emmy Viewership Across Genres

I was wondering how the TV viewership varies across different genres. Searching for data, I could not get much besides Wikipedia, and settled for US viewership across two genres – Drama and Comedy.

The first step was to scrape the data using R (rvest package), I should have made a refined function but for now this is what it is:


## `````````````````````````````````````````````
#### Scrap Data ####
## `````````````````````````````````````````````

### creating a df of wikipedia URL for all winners ####
 # 1. wd ####
 # read list of winners from 2015 back to 2005
 setwd("") # as per your system
 # 2. df.winners ####
 df.winners = read.csv(
 "2. Data/winnerlist.csv",
 header = TRUE,
 stringsAsFactors = FALSE,
 na.strings = c("", "NA")
 )
 # 3. v.underscore ####
 # append underscore to series names"
 v.underscore =
 str_trim(df.winners$name) %>%
 str_to_lower() %>%
 str_replace_all(" ","_")
 # 4. special cases ####
 # a. veep
 # capitalize it again
 v.underscore[2] = "Veep"
 # b. the office
 # https://en.wikipedia.org/wiki/List_of_The_Office_(U.S._TV_series)_episodes
 v.underscore[20] = "The_Office_(U.S._TV_series)"
 # c.everybody loves raymond
 # capitalize again
 v.underscore[22] = "Everybody_Loves_Raymond"
 # https://en.wikipedia.org/wiki/List_of_everybody_loves_raymond_episodes
 # https://en.wikipedia.org/wiki/List_of_Everybody_Loves_Raymond_episodes
 # 5. v.underscore.1 ####
 v.underscore.1 = sapply(v.underscore, function(x) {paste0("_",x,"_")}, simplify=TRUE, USE.NAMES=FALSE)
 # 6. base and tail URL ####
 base.url="https://en.wikipedia.org/wiki/List_of"
 tail.url = "episodes"
 # 7. v.url ####
 v.url = sapply(v.underscore.1, function(x) {paste0(base.url,x,tail.url)}, simplify=TRUE, USE.NAMES=FALSE)
 # 8. test valid url ####
 browseURL(sample(v.url,1))
 # 9. clean up ####
 rm(base.url)
 rm(tail.url)
 rm(v.underscore)
 rm(v.underscore.1)

### looping mechanism for going through all urls in df.winners ####
 # 1.vector of means ####
 # src:
 # http://stackoverflow.com/questions/22235809/append-value-to-empty-vector-in-r
 # <<- global assignment
 v.mean <<- numeric(NROW(df.winners))
 # 2.counter var ####
 # not required any more
 # i.count = 0
 # 3.url var ####
 i.url = 1
 # 4.reset i.mean ####
 i.mean = 0
 # 5.download table from wikipedia ####
 ## SRC
 ## http://stackoverflow.com/questions/7407735/importing-wikipedia-tables-in-r
 ## PROBLEM: We do not know which number of tables
 ## or season # is the table #

# helper function
 ## function get.mean() ####
 get.mean = function(i.url, i.table)
 {
 # if(i.url == 21)
 # browser()

df.temp = htmltab(v.url[i.url],i.table)

# remove row with description
 v.t =
 df.temp %>%
 select(contains("viewers"))

# assuming non-text row has length less than 6 (2 decimal places 10.88 is length 5)
 v.t = v.t[(sapply(v.t,str_length))<6,]

# dump v.t for later
 f.name = paste0(i.url,".csv")
 f.name = paste0("4. Scraped Data/",f.name)

# storing Title Name
 s.url = sprintf("The url is %s",v.url[i.url])
 write.table(s.url,file=f.name, row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)

# Storing Table Number
 s.table = sprintf("The table is %s",i.table)
 write.table(s.table,file=f.name, row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)

# Storing Table Values
 s.value = sprintf("The values are")
 write.table(s.value,file=f.name, row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)
 write.table(v.t,file=f.name, row.names=T,na="NA",append=T, quote= FALSE, sep=",", col.names=F)

# calculating mean of the col containing viewers
 i.mean =
 v.t %>%
 as.matrix() %>%
 as.vector() %>%
 as.numeric() %>%
 mean(.,na.rm=TRUE) %>%
 round(.,2)

return(i.mean)

} # end get.mean

### scrap one by one data for each winner ####

# 2015 d ####
 # for game of thrones, season 5 is the winning season
 i.table = 5
 i.mean = get.mean(i.url,i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2015 c ####
 # for veep, season 4 is the winning season
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2014 d ####
 # for breaking bad
 #browseURL(v.url[3])
 i.table = 6
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2014 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 6
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2013 d ####
 # for breaking bad
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2013 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2012 d ####
 # for homeland
 #browseURL(v.url[3])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2012 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 4
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2011 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2011 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2010 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 4
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2010 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2009 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2009 c ####
 # for 30 Rock
 # browseURL(v.url[i.url])
 i.table = 4
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2008 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2008 c ####
 # for 30 Rock
 # browseURL(v.url[i.url])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2007 d ####
 # for the sopranos
 #browseURL(v.url[3])
 i.table = 7
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2007 c ####
 # for 30 Rock
 # browseURL(v.url[i.url])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2006 d ####
 # for 24
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2006 c ####
 # for The Office
 # browseURL(v.url[i.url])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2005 d ####
 # for Lost
 #browseURL(v.url[3])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2005 c ####
 # for Everybody Loves Ramond
 # browseURL(v.url[i.url])
 i.table = 10
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2004 d ####
 # for The Sopranos
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2004 c ####
 # for Arrested Devleopment
 # browseURL(v.url[i.url])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

## `````````````````````````````````````````````

Once we have scraped the data, its time for visualizing it. Here is how it goes:

## `````````````````````````````````````````````
#### Visulization ####
## `````````````````````````````````````````````

## setup ####

## appending the mean to the df
df.winners$mean.us.viewers.m = v.mean

# making genre a factor
df.winners$genre = as.factor(df.winners$genre)

## clean up
#rm(i.count)
rm(i.mean)
rm(i.table)
rm(i.url)
rm(v.mean)
rm(v.url)

## writing out the csv file
# write.csv(df.winners,"winnersList.csv")

# for plotting setting Year as Date Type (else treated as integer)
df.winners$year2 = as.Date(as.character(df.winners$year),format="%Y")

## color codes ####
col.grey = "#707070"
col.teal = "#368C8C"
col.blue = "#4682B4"
col.mid.green = "#98EFC1"
col.lig.green = "#B8FFD1"
col.dark.red = "darkred"


## g.1 ####

g.1 = ggplot(data=df.winners) + theme_minimal()

g.1 = g.1 + geom_line(data = filter(df.winners, grepl('COMEDY', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.8,size=0.5,col=col.lig.green)

g.1 = g.1 + geom_point(data = filter(df.winners, grepl('COMEDY', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.6,size=2,col=col.blue)

g.1 = g.1 + geom_line(data = filter(df.winners, grepl('DRAMA', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.8,size=0.5,col=col.lig.green)

g.1 = g.1 + geom_point(data = filter(df.winners, grepl('DRAMA', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.4,size=2,col=col.dark.red)

#g.1 = g.1 + geom_point(aes(x=year2,y=mean.us.viewers.m, col=genre),size=2)

g.2 = g.1 + theme (
panel.background = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
#plot.caption=element_text(size=8, margin=margin(t=24),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
#axis.title.x = element_blank(margin=margin(t=-24)),
#axis.title.y = element_blank()
axis.title.y = element_text(size = 8),
#axis.title.x = element_text(margin=margin(t=-20)),
axis.title.x = element_text(margin=margin(0,0,-10,0), vjust=-5),

# src:
# https://rud.is/b/2016/06/16/your-data-vis-spidey-sense-the-need-for-a-robust-utility-belt/
plot.subtitle=element_text(size=9.5, margin=margin(b=10)),
plot.caption=element_text(size=7, margin=margin(t=-10)),
# margin around entire plot (‘unit’ with the sizes of the top, right, bottom, and left
# margins)
plot.margin=margin(10,10,10,10)

)

g.2 = g.2 +
labs(
x = NULL,
y = NULL,
title = "How viewership varies for Emmy Winners",
subtitle = "Mean of \"US Viewers Per Episode\", in Millions across the years (Wikipedia)"#,
# overlaps x axis
#caption = "Source: Wikipedia"
)

g.2 = g.2 +
scale_x_date(
breaks = seq(as.Date("2004/1/1"), as.Date("2016/1/1"), by = "years"),
labels = c("2004","","06","","08","","10","","12","","14","","16")
)

g.2 = g.2 + scale_y_continuous(limits=c(0,20))

# label for genre DRAMA
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2004/2/1"), y=10, label="DRAMA Genre"),col=col.dark.red,alpha=0.4,fontface = "bold")

# label for genre COMEDY
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2004/2/1"), y=5, label="COMEDY Genre"),col=col.blue,alpha=0.6,fontface = "bold")


# label for cross over point 1
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2005/12/1"), y=19, label="Cross-over \n Point 1"))

# highlight circle 1 for cross over point 1
g.2 = g.2 + geom_point(data=data.frame(),aes(x=as.Date("2005/7/1"), y=19),alpha=0.1,size=8,col=col.grey)

# label for cross over point 2
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2007/9/1"), y=4.3, label="Cross-over \n Point 2"))

# highlight circle 1 for cross over point 2
g.2 = g.2 + geom_point(data=data.frame(),aes(x=as.Date("2007/12/1"), y=5.7),alpha=0.1,size=8,col=col.grey)

# label for cross over point 3
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2014/1/1"), y=5.3, label="Cross-over \n Point 3"))

# highlight circle 1 for cross over point 3
g.2 = g.2 + geom_point(data=data.frame(),aes(x=as.Date("2015/1/1"), y=5.7),alpha=0.1,size=8,col=col.grey)

# Big bold line at y=0
# src
# http://t-redactyl.io/blog/2016/01/creating-plots-in-r-using-ggplot2-part-3-bar-plots.html
g.2 = g.2 + geom_hline(yintercept=0,size=1.2,colour="#535353")

g.2


ggsave(filename="538.png",dpi=600)

 
And here is the final result:

Plot of Viewership for Emmy Winners
Plot of Viewership for Emmy Winners

For formatted code: https://github.com/patternproject/r.emmyWinners

IKMC Data Scraping

My son recently appeared for IKMC, a mathematics competition, which starts inter-class and then scales up to the global level.

I was exploring  their website, where I came across some global data on participation in the competition. I thought of scraping it out and doing some visualization.

This part explore the scraping bit.

# Note adding four # at the end of comment turns it into a foldable section in Rstudio
## loading libraries ####
library(dplyr)
install.packages("rvest")
library(rvest)
library(stringr)
## setting up the url ####
url = "http://ikmc.kangaroo.org.pk/global-statistics.html"
css = "#DivData td , strong"
css.country = "td:nth-child(1) strong" # only the country names
# testing valid url
browseURL(url)
## reading in the data ####
# reading the url html
page <- read_html(url)
## country list ####
raw.country <- page %>%
 html_nodes(css.country) %>%
 html_text(trim = TRUE) %>%
 unlist()
# except the top heading "Country" all the rest are valid entries
raw.country <- raw.country[-1]; # without 1st element which is simply "Country"
# total count of countries, there is no repetition here
total.rows = length(raw.country)
## years list ####
# no of years, goes from 1994 to 2014
total.columns = 2014- 1994 + 1
## all data read in ####
# the complete data matrix
raw.data <- page %>%
 html_nodes(css) %>%
 html_text(trim = TRUE) # %>%
 #str_replace_all("[^[:digit:]]","") # removing extra text
data.1 <- unlist(raw.data)
## generic analysis ####
# returns the first location of 'b', in this case: 2
# # match('b',v)
first.country.start = match(raw.country[1],data.1)
first.country.end = match(raw.country[2],data.1)
i.temp = seq(1,first.country.start-1) #not to miss the double "armenia"
value.data <- raw.data[-i.temp]
# always have 20 values per country 
mydf <- data.frame(matrix(value.data, nrow=total.rows, ncol=23, byrow=TRUE))
# subset to remove the 1st column
my.df = mydf[-1]
# setting the names
df.names = as.character(seq(2014,1994))
df.names = c("Country",df.names)
names(my.df) = df.names

 

21 March – MakeOverMonday

Andy suggested a make over of Guardian’s dynamic visualization of Women’s right across various countries.

Instead of keeping all of them together, I made a box plot for each metric, such as Abortion separately. And then combined all six together for the final plot.

For individual charts, I counted the number of Yes to various questions and this “YesCount” is plotted on the x-axis. I mapped the countries to region, using the country index, included with the data set. However there were 24 countries without an index, which were excluded from further analysis.

test
Womens Rights

I have overlaid the individual observations as dots on top, with a little jitter to show their density. I am not too sure what happened with “Equality”.

Here is the R Code:

library(dplyr)
library(ggplot2)
library(tidyr)
# --------------------------------------------------------------------
# Global Settings
# --------------------------------------------------------------------
title.color = "darkolivegreen"
# --------------------------------------------------------------------
df.region = read.csv("Region Info.csv")
df.region <- df.region %>%
 select(Region,Country,Country.Code)
# --------------------------------------------------------------------
# Abortion
# --------------------------------------------------------------------
df.abortion = read.csv("Abortion.csv")
names(df.abortion) = c("Country","Q1","Q2","Q3","Q4","Q5","Q6","Q7")
df.m.abortion <- df.abortion %>%
 gather(key=Country)
names(df.m.abortion)[2] = "Question"
df.m.abortion.reduced <- df.m.abortion %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.abortion.region <- left_join(df.m.abortion.reduced, df.region)
df.abortion.2 = df.abortion.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.abortion.2$Region = as.factor(df.abortion.2$Region)
p1.1 <- ggplot(df.abortion.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width=.2), size=2,alpha=0.2) +
 coord_flip()
p1.2 = p1.1 + ggtitle("Abortion")
p1.3 <- p1.2 +
 theme_minimal() +
 theme(
 #panel.grid.minor = element_blank(),
 #panel.grid.major = element_blank(),
 #axis.text.y = element_blank(),
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p1.4 <- p1.3 + 
 scale_y_continuous(breaks=seq(1, 7, 1)) # Ticks from 1-7, every 1
# final plot
p1.4
# --------------------------------------------------------------------
# unable to set the palette
library("RColorBrewer", lib.loc="/usr/local/lib/R/site-library")
display.brewer.pal(7,"Pastel1")
mypalette<-brewer.pal(7,"Pastel1")
# p7 <- p6 + 
# scale_fill_brewer(mypalette)
# removing df not required anymore
rm(df.abortion,df.m.abortion,df.m.abortion.reduced,df.abortion.region)
# --------------------------------------------------------------------
# Constitution
# --------------------------------------------------------------------
df.constitution = read.csv("Constitution.csv")
names(df.constitution) = c("Country","Q1","Q2","Q3")
df.m.constitution <- df.constitution %>%
 gather(key=Country)
names(df.m.constitution)[2] = "Question"
df.m.constitution.reduced <- df.m.constitution %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.constitution.region <- left_join(df.m.constitution.reduced, df.region)
df.constitution.2 = df.constitution.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.constitution.2$Region = as.factor(df.constitution.2$Region)
# p2.1 <- ggplot(df.constitution.2, aes(x=Region,y=YesCount)) +
# geom_boxplot(aes(fill=Region,colour=Region),alpha=I(0.3)) + 
# geom_jitter(position=position_jitter(width = 0.2, height = 0.01), size=2,alpha=0.5) +
# coord_flip()
p2.1 <- ggplot(df.constitution.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p2.2 = p2.1 + ggtitle("Constitution")
p2.3 <- p2.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p2.4 <- p2.3 + 
 scale_y_continuous(breaks=seq(1, 3, 1)) # Ticks from 1-3, every 1
# final plot
p2.4
# removing df not required anymore
rm(df.constitution,df.m.constitution,df.m.constitution.reduced,df.constitution.region)
# --------------------------------------------------------------------
# Domestic Violence
# --------------------------------------------------------------------
df.violence = read.csv("Domestic Violence.csv")
names(df.violence) = c("Country","Q1","Q2","Q3","Q4","Q5","Q6","Q7")
df.m.violence <- df.violence %>%
 gather(key=Country)
names(df.m.violence)[2] = "Question"
df.m.violence.reduced <- df.m.violence %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.violence.region <- left_join(df.m.violence.reduced, df.region)
df.violence.2 = df.violence.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.violence.2$Region = as.factor(df.violence.2$Region)
p3.1 <- ggplot(df.violence.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p3.2 = p3.1 + ggtitle("Violence")
p3.3 <- p3.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p3.4 <- p3.3 + 
 scale_y_continuous(breaks=seq(1, 7, 1)) # Ticks from 1-7, every 1
# final plot
p3.4
# removing df not required anymore
rm(df.violence,df.m.violence,df.m.violence.reduced,df.violence.region)
# --------------------------------------------------------------------
# Other Equality
# --------------------------------------------------------------------
df.equality = read.csv("Other Equality.csv")
names(df.equality) = c("Country","Q1","Q2","Q3")
df.m.equality <- df.equality %>%
 gather(key=Country)
names(df.m.equality)[2] = "Question"
df.m.equality.reduced <- df.m.equality %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.equality.region <- left_join(df.m.equality.reduced, df.region)
df.equality.2 = df.equality.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.equality.2$Region = as.factor(df.equality.2$Region)
p4.1 <- ggplot(df.equality.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p4.2 = p4.1 + ggtitle("Equality")
p4.3 <- p4.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p4.4 <- p4.3 + 
 scale_y_continuous(breaks=seq(1, 3, 1)) # Ticks from 1-3, every 1
# final plot
p4.4
# removing df not required anymore
rm(df.equality,df.m.equality,df.m.equality.reduced,df.equality.region)
# --------------------------------------------------------------------
# Property
# --------------------------------------------------------------------
df.property = read.csv("Property.csv")
names(df.property) = c("Country","Q1","Q2","Q3")
df.m.property <- df.property %>%
 gather(key=Country)
names(df.m.property)[2] = "Question"
df.m.property.reduced <- df.m.property %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.property.region <- left_join(df.m.property.reduced, df.region)
df.property.2 = df.property.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.property.2$Region = as.factor(df.property.2$Region)
p5.1 <- ggplot(df.property.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p5.2 = p5.1 + ggtitle("Property")
p5.3 <- p5.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p5.4 <- p5.3 + 
 scale_y_continuous(breaks=seq(1, 3, 1)) # Ticks from 1-3, every 1
# final plot
p5.4
# removing df not required anymore
rm(df.property,df.m.property,df.m.property.reduced,df.property.region)
# --------------------------------------------------------------------
# Work
# --------------------------------------------------------------------
df.work = read.csv("Work.csv")
names(df.work) = c("Country","Q1","Q2","Q3","Q4","Q5","Q6","Q7")
df.m.work <- df.work %>%
 gather(key=Country)
names(df.m.work)[2] = "Question"
df.m.work.reduced <- df.m.work %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.work.region <- left_join(df.m.work.reduced, df.region)
df.work.2 = df.work.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.work.2$Region = as.factor(df.work.2$Region)
p6.1 <- ggplot(df.work.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p6.2 = p6.1 + ggtitle("Work")
p6.3 <- p6.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p6.4 <- p6.3 + 
 scale_y_continuous(breaks=seq(1, 7, 1)) # Ticks from 1-3, every 1
# final plot
p6.4
# removing df not required anymore
rm(df.work,df.m.work,df.m.work.reduced,df.work.region)
# --------------------------------------------------------------------
# Combining all
# --------------------------------------------------------------------
# install.packages("cowplot")
# require(cowplot)
# library(cowplot)
require(gridExtra)
# pdf("foo.pdf")
# grid.arrange(p1.4,p2.4, p3.4,p4.4, p5.4,p6.4,ncol=2, nrow=3)
# dev.off()
# hint from https://jonkimanalyze.wordpress.com/2014/03/26/ggplot2-arrangegrob-arrange-ggplots-on-a-page/
grobframe <- arrangeGrob(p1.4,p2.4, p3.4,p4.4, p5.4,p6.4,ncol=2, nrow=3,
main = textGrob("\nWomen's Rights", gp = gpar(fontsize=60, fontface="bold")),
sub = textGrob("*X-Axis depicts Total Number of Yes to Different Parameters*", x=0, hjust=-0.5, vjust=0.1, gp = gpar(fontface = "italic", fontsize = 15)))
print(grobframe)
ggsave(filename="test.png",plot=grobframe,width = 20, height = 25, units = "in",dpi=400)

					

2016 January – Dear Data Challenge

The theme for Jan was “Things We Say” … Here is the journey …

NLP

The first idea was to explore “NLP”. Interesting links are as follows:

  1. A toolkit for corpus linguistics (https://github.com/interrogator/corpkit)
  2. Machine Learning for Text Analysis (http://www.monkeylearn.com/)
  3. Segmentation of Twitter Timelines via Topic Modeling (http://alexperrier.github.io/jekyll/update/2015/09/16/segmentation_twitter_timelines_lda_vs_lsa.html)

Beaker

Given “Pyhton” was preferred over “R”, I thought of doing a mash-up of the two, and “Jupyter” led me to “Beaker” (http://www.opendatascience.com/blog/jupyter-zeppelin-beaker-the-rise-of-the-notebooks/) and (http://blog.dominodatalab.com/interactive-data-science/)

Time Maps

Another idea was to explore “Time Maps” as done here to explore the 2015 Presidential Debates (http://alexperrier.github.io/jekyll/update/2015/11/19/timemaps-presidential-debates-dynamics.html)

Love Actually

But if you do explore any of recommendations from this blog, go to (http://varianceexplained.org/r/love-actually-network/) An amazing analysis of the movie dialogues from “Love Actually” by David Robinson.

Kate Winslet mentioned relatively longer dialogues in Steve Jobs (2015 movie). Inspired by analysis of “Love Actually” I thought to compare the average length of her dialogues to some of her other movies. But I could not pass the hurdle of parsing the movie script file, which was devoid of any delimiters for the dialogues.

Sentiment Analysis could have been one option, as done here for State of the Union speeches.(http://www.moreorlessnumbers.com/2016/01/state-of-union-speeches-and-data.html)

WhatsApp and Facebook Analysis by Forrester

Reineke Reitsma mentioned share of different messangers (Viber, Skype .. ) to share the new year wishes across Europe. (http://blogs.forrester.com/reineke_reitsma/15-01-05-the_data_digest_whatsapp_and_facebook_messenger_wish_us_a_happy_new_year)

And here is an R package that provides a suite of tools for collecting and constructing networks from social media data (https://t.co/91RkKRTby4)

My Analysis

I stuck to get a count of different messages in my email – as follows:

2016 Jan Data Capture

And ended up doing a chart of split across the categories:

Analysis of Email Rcvd

And:

Analysis of Email Sent

This is the hand drawn version:

2016 Jan - Front

and the back end:

2016 Jan - Back

Link

2015 December – Dear Data Challenge

ANALYSIS

December’s challenge was “Christmas Music” for Dear Data / Chain Data.

I confused the instructions of “dear data” where it mentions “one week”. I understood, that it has to be my personal data collected over a period of one week. Thanks to Bridget (@windscogley) for clarifying it and throwing in some ideas to get me started …

Here is a run down of my journey during the month:

https://music-api.musikki.com/
http://www.last.fm/api

Spotify Web API

https://developer.musicgraph.com/
https://developer.musicgraph.com/api-docs/v2/playlists

https://imvdb.com/developers/sandbox

Music APIs

http://www.cs.umd.edu/projects/MusicDigger/

https://www.discogs.com/developers/
https://musicbrainz.org/doc/MusicBrainz_Database

WORKING

  • I settled for plotting 2014 stats of music consumption across the various genres. Here it goes.
  • The data set from Neilsen’s 2014 Music Report for US:

Nielson 2014 Music Report

I calculated the index using min max difference, as follows:

Index Data

And then created a “Parallel Coordinate Plot

Parallel Coordinate Plot

Followed by “hand drawing” it, I only drew the points, excluding the lines for clutter:

Dear Data Dec - Front

2015 November – Dear Data Challenge

On the lines of dear data (http://www.dear-data.com/)  where two courageous women

switched continents get to know each other through the data they draw and send across the pond

Brittany Fong (@BFongData) took the initiative to take it wide and far. In her words:

Contrary to how we normally think about data, dear data isn’t about the numbers being visualized as accurately as possible.  I mean obviously we don’t want you to just make stuff up but the expectations for dear data are about stepping out of the box, doing something you’re not used to, and being creative

I have joined the bandwagon, and this is what I could muster to draw by hand … yeah it takes a lot of courage to ditch your favorite visualization tool

Dear Data

Here is the same thing, done by the machine

2015 Nov – Food – Ul Haq

(I was only able to recreate the third chart by hand, using slightly different color scheme)

In summary, we collected food related data for a week, and then analyzed it. I used the Australian Survey Classification System to classify all food items and then presented it in a stacked bar chart

Here is the summary stats for the chart:

Summary Data

and the same summary stats, on the back of the chart

2015 Nov - Food