What’s On The Menu (2)

Inspired from Rasmus Bååth’s post “A Fun Gastronomical Dataset: What’s on the Menu?” I set out to do something similar. Starting from the CSV file kindly shared by the maestro.

I thought of categorising the various food items and analyse the trend of healthy eating over time. However, I was stumped fairly quickly, failing to execute matches for words , though documentation suggested it being possible using “boundary”.

Sadly “Chateau” and “Steak” both matched the keyword “tea”. My unsuccessful attempt looked something like this – and various other combos :

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word"))))

Barging ahead – prices seemed a good starting point, as to how have they fared across the years.

food_over_time <- map_df(food, function(food)
{

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word")))) %>%
 mutate(food = food) %>%
 group_by(year,food) %>%
 summarise(avg_price = mean(price,na.rm = TRUE))

}) # end food_over_time

I ended up with a lot of NAN(s) and as stated here and here, they are solely my doing. On a side note, Peter Bashai mentions a few methods to plot the missing values in d3.

Ignoring  why NAN were produced, I decide to simply replace them with NA

# is.nan is provided to check specifically for NaN
food_over_time %>% 
 filter(is.nan(avg_price))

food_over_time %>% 
 mutate(avg_price = ifelse(is.nan(avg_price),NA,avg_price))

And then imputing the data. Using the default values from mice package.

# imputing it
imp_food_over_time = 
 food_over_time %>%
 mice()

summary(imp_food_over_time)

imp_data = complete(imp_food_over_time,1)

Relying on the ggplot snipped from the original post

# A reusable list of ggplot2 directives to produce a lineplot
food_time_plot <- list(
 geom_line(),
 geom_point(),
 facet_wrap(~ food),
 theme_minimal(),
 theme(legend.position = "none"))

And now plotting:

food_over_time %>% filter(food %in% c("coffee", "tea")) %>%
ggplot(aes(year, avg_price , color = food)) + food_time_plot

this is what I get:

rplot2

ooops …

we can remove that peaking value of 2103 in year 1999 but something is fundamentally wrong there . Here is the data set for 1999

# what happened to tea in 1999
df.tea.99 = 
df.master %>%
 filter(year == 1999) %>%
 filter(str_detect(dish_name, regex("tea", ignore_case = TRUE, boundary=type("word")))) %>%
 select(year,dish_name,price)

and the resulting data set has entries like “Chateaubriand tranche a table, aux legumes en sauce bearnaise” which have nothing to do with “tea”

 

 

 

 

Emmy Viewership Across Genres

I was wondering how the TV viewership varies across different genres. Searching for data, I could not get much besides Wikipedia, and settled for US viewership across two genres – Drama and Comedy.

The first step was to scrape the data using R (rvest package), I should have made a refined function but for now this is what it is:


## `````````````````````````````````````````````
#### Scrap Data ####
## `````````````````````````````````````````````

### creating a df of wikipedia URL for all winners ####
 # 1. wd ####
 # read list of winners from 2015 back to 2005
 setwd("") # as per your system
 # 2. df.winners ####
 df.winners = read.csv(
 "2. Data/winnerlist.csv",
 header = TRUE,
 stringsAsFactors = FALSE,
 na.strings = c("", "NA")
 )
 # 3. v.underscore ####
 # append underscore to series names"
 v.underscore =
 str_trim(df.winners$name) %&amp;amp;amp;gt;%
 str_to_lower() %&amp;amp;amp;gt;%
 str_replace_all(" ","_")
 # 4. special cases ####
 # a. veep
 # capitalize it again
 v.underscore[2] = "Veep"
 # b. the office
 # https://en.wikipedia.org/wiki/List_of_The_Office_(U.S._TV_series)_episodes
 v.underscore[20] = "The_Office_(U.S._TV_series)"
 # c.everybody loves raymond
 # capitalize again
 v.underscore[22] = "Everybody_Loves_Raymond"
 # https://en.wikipedia.org/wiki/List_of_everybody_loves_raymond_episodes
 # https://en.wikipedia.org/wiki/List_of_Everybody_Loves_Raymond_episodes
 # 5. v.underscore.1 ####
 v.underscore.1 = sapply(v.underscore, function(x) {paste0("_",x,"_")}, simplify=TRUE, USE.NAMES=FALSE)
 # 6. base and tail URL ####
 base.url="https://en.wikipedia.org/wiki/List_of"
 tail.url = "episodes"
 # 7. v.url ####
 v.url = sapply(v.underscore.1, function(x) {paste0(base.url,x,tail.url)}, simplify=TRUE, USE.NAMES=FALSE)
 # 8. test valid url ####
 browseURL(sample(v.url,1))
 # 9. clean up ####
 rm(base.url)
 rm(tail.url)
 rm(v.underscore)
 rm(v.underscore.1)

### looping mechanism for going through all urls in df.winners ####
 # 1.vector of means ####
 # src:
 # http://stackoverflow.com/questions/22235809/append-value-to-empty-vector-in-r
 # &amp;amp;amp;lt;&amp;amp;amp;lt;- global assignment
 v.mean &amp;amp;amp;lt;&amp;amp;amp;lt;- numeric(NROW(df.winners))
 # 2.counter var ####
 # not required any more
 # i.count = 0
 # 3.url var ####
 i.url = 1
 # 4.reset i.mean ####
 i.mean = 0
 # 5.download table from wikipedia ####
 ## SRC
 ## http://stackoverflow.com/questions/7407735/importing-wikipedia-tables-in-r
 ## PROBLEM: We do not know which number of tables
 ## or season # is the table #

# helper function
 ## function get.mean() ####
 get.mean = function(i.url, i.table)
 {
 # if(i.url == 21)
 # browser()

df.temp = htmltab(v.url[i.url],i.table)

# remove row with description
 v.t =
 df.temp %&amp;amp;amp;gt;%
 select(contains("viewers"))

# assuming non-text row has length less than 6 (2 decimal places 10.88 is length 5)
 v.t = v.t[(sapply(v.t,str_length))&amp;amp;amp;lt;6,]

# dump v.t for later
 f.name = paste0(i.url,".csv")
 f.name = paste0("4. Scraped Data/",f.name)

# storing Title Name
 s.url = sprintf("The url is %s",v.url[i.url])
 write.table(s.url,file=f.name, row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)

# Storing Table Number
 s.table = sprintf("The table is %s",i.table)
 write.table(s.table,file=f.name, row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)

# Storing Table Values
 s.value = sprintf("The values are")
 write.table(s.value,file=f.name, row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)
 write.table(v.t,file=f.name, row.names=T,na="NA",append=T, quote= FALSE, sep=",", col.names=F)

# calculating mean of the col containing viewers
 i.mean =
 v.t %&amp;amp;amp;gt;%
 as.matrix() %&amp;amp;amp;gt;%
 as.vector() %&amp;amp;amp;gt;%
 as.numeric() %&amp;amp;amp;gt;%
 mean(.,na.rm=TRUE) %&amp;amp;amp;gt;%
 round(.,2)

return(i.mean)

} # end get.mean

### scrap one by one data for each winner ####

# 2015 d ####
 # for game of thrones, season 5 is the winning season
 i.table = 5
 i.mean = get.mean(i.url,i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2015 c ####
 # for veep, season 4 is the winning season
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2014 d ####
 # for breaking bad
 #browseURL(v.url[3])
 i.table = 6
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2014 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 6
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2013 d ####
 # for breaking bad
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2013 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2012 d ####
 # for homeland
 #browseURL(v.url[3])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2012 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 4
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2011 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2011 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2010 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 4
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2010 c ####
 # for modern family
 # browseURL(v.url[i.url])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2009 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2009 c ####
 # for 30 Rock
 # browseURL(v.url[i.url])
 i.table = 4
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2008 d ####
 # for mad men
 #browseURL(v.url[3])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2008 c ####
 # for 30 Rock
 # browseURL(v.url[i.url])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2007 d ####
 # for the sopranos
 #browseURL(v.url[3])
 i.table = 7
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2007 c ####
 # for 30 Rock
 # browseURL(v.url[i.url])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2006 d ####
 # for 24
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2006 c ####
 # for The Office
 # browseURL(v.url[i.url])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2005 d ####
 # for Lost
 #browseURL(v.url[3])
 i.table = 3
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2005 c ####
 # for Everybody Loves Ramond
 # browseURL(v.url[i.url])
 i.table = 10
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2004 d ####
 # for The Sopranos
 #browseURL(v.url[3])
 i.table = 5
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

# 2004 c ####
 # for Arrested Devleopment
 # browseURL(v.url[i.url])
 i.table = 2
 i.mean = get.mean(i.url, i.table)

# storing the mean
 v.mean[i.url] = i.mean

# updating counters
 i.url = i.url + 1

## `````````````````````````````````````````````

Once we have scraped the data, its time for visualizing it. Here is how it goes:

## `````````````````````````````````````````````
#### Visulization ####
## `````````````````````````````````````````````

## setup ####

## appending the mean to the df
df.winners$mean.us.viewers.m = v.mean

# making genre a factor
df.winners$genre = as.factor(df.winners$genre)

## clean up
#rm(i.count)
rm(i.mean)
rm(i.table)
rm(i.url)
rm(v.mean)
rm(v.url)

## writing out the csv file
# write.csv(df.winners,"winnersList.csv")

# for plotting setting Year as Date Type (else treated as integer)
df.winners$year2 = as.Date(as.character(df.winners$year),format="%Y")

## color codes ####
col.grey = "#707070"
col.teal = "#368C8C"
col.blue = "#4682B4"
col.mid.green = "#98EFC1"
col.lig.green = "#B8FFD1"
col.dark.red = "darkred"


## g.1 ####

g.1 = ggplot(data=df.winners) + theme_minimal()

g.1 = g.1 + geom_line(data = filter(df.winners, grepl('COMEDY', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.8,size=0.5,col=col.lig.green)

g.1 = g.1 + geom_point(data = filter(df.winners, grepl('COMEDY', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.6,size=2,col=col.blue)

g.1 = g.1 + geom_line(data = filter(df.winners, grepl('DRAMA', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.8,size=0.5,col=col.lig.green)

g.1 = g.1 + geom_point(data = filter(df.winners, grepl('DRAMA', genre)),
aes(x = year2, y = mean.us.viewers.m),alpha=0.4,size=2,col=col.dark.red)

#g.1 = g.1 + geom_point(aes(x=year2,y=mean.us.viewers.m, col=genre),size=2)

g.2 = g.1 + theme (
panel.background = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
#plot.caption=element_text(size=8, margin=margin(t=24),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
#axis.title.x = element_blank(margin=margin(t=-24)),
#axis.title.y = element_blank()
axis.title.y = element_text(size = 8),
#axis.title.x = element_text(margin=margin(t=-20)),
axis.title.x = element_text(margin=margin(0,0,-10,0), vjust=-5),

# src:
# https://rud.is/b/2016/06/16/your-data-vis-spidey-sense-the-need-for-a-robust-utility-belt/
plot.subtitle=element_text(size=9.5, margin=margin(b=10)),
plot.caption=element_text(size=7, margin=margin(t=-10)),
# margin around entire plot (‘unit’ with the sizes of the top, right, bottom, and left
# margins)
plot.margin=margin(10,10,10,10)

)

g.2 = g.2 +
labs(
x = NULL,
y = NULL,
title = "How viewership varies for Emmy Winners",
subtitle = "Mean of \"US Viewers Per Episode\", in Millions across the years (Wikipedia)"#,
# overlaps x axis
#caption = "Source: Wikipedia"
)

g.2 = g.2 +
scale_x_date(
breaks = seq(as.Date("2004/1/1"), as.Date("2016/1/1"), by = "years"),
labels = c("2004","","06","","08","","10","","12","","14","","16")
)

g.2 = g.2 + scale_y_continuous(limits=c(0,20))

# label for genre DRAMA
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2004/2/1"), y=10, label="DRAMA Genre"),col=col.dark.red,alpha=0.4,fontface = "bold")

# label for genre COMEDY
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2004/2/1"), y=5, label="COMEDY Genre"),col=col.blue,alpha=0.6,fontface = "bold")


# label for cross over point 1
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2005/12/1"), y=19, label="Cross-over \n Point 1"))

# highlight circle 1 for cross over point 1
g.2 = g.2 + geom_point(data=data.frame(),aes(x=as.Date("2005/7/1"), y=19),alpha=0.1,size=8,col=col.grey)

# label for cross over point 2
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2007/9/1"), y=4.3, label="Cross-over \n Point 2"))

# highlight circle 1 for cross over point 2
g.2 = g.2 + geom_point(data=data.frame(),aes(x=as.Date("2007/12/1"), y=5.7),alpha=0.1,size=8,col=col.grey)

# label for cross over point 3
g.2 = g.2 + geom_text(data=data.frame(), hjust=0, size=2,
aes(x=as.Date("2014/1/1"), y=5.3, label="Cross-over \n Point 3"))

# highlight circle 1 for cross over point 3
g.2 = g.2 + geom_point(data=data.frame(),aes(x=as.Date("2015/1/1"), y=5.7),alpha=0.1,size=8,col=col.grey)

# Big bold line at y=0
# src
# http://t-redactyl.io/blog/2016/01/creating-plots-in-r-using-ggplot2-part-3-bar-plots.html
g.2 = g.2 + geom_hline(yintercept=0,size=1.2,colour="#535353")

g.2


ggsave(filename="538.png",dpi=600)

 
And here is the final result:

Plot of Viewership for Emmy Winners
Plot of Viewership for Emmy Winners

For formatted code: https://github.com/patternproject/r.emmyWinners

IKMC Data Scraping

My son recently appeared for IKMC, a mathematics competition, which starts inter-class and then scales up to the global level.

I was exploring  their website, where I came across some global data on participation in the competition. I thought of scraping it out and doing some visualization.

This part explore the scraping bit.

# Note adding four # at the end of comment turns it into a foldable section in Rstudio
## loading libraries ####
library(dplyr)
install.packages("rvest")
library(rvest)
library(stringr)
## setting up the url ####
url = "http://ikmc.kangaroo.org.pk/global-statistics.html"
css = "#DivData td , strong"
css.country = "td:nth-child(1) strong" # only the country names
# testing valid url
browseURL(url)
## reading in the data ####
# reading the url html
page <- read_html(url)
## country list ####
raw.country <- page %>%
 html_nodes(css.country) %>%
 html_text(trim = TRUE) %>%
 unlist()
# except the top heading "Country" all the rest are valid entries
raw.country <- raw.country[-1]; # without 1st element which is simply "Country"
# total count of countries, there is no repetition here
total.rows = length(raw.country)
## years list ####
# no of years, goes from 1994 to 2014
total.columns = 2014- 1994 + 1
## all data read in ####
# the complete data matrix
raw.data <- page %>%
 html_nodes(css) %>%
 html_text(trim = TRUE) # %>%
 #str_replace_all("[^[:digit:]]","") # removing extra text
data.1 <- unlist(raw.data)
## generic analysis ####
# returns the first location of 'b', in this case: 2
# # match('b',v)
first.country.start = match(raw.country[1],data.1)
first.country.end = match(raw.country[2],data.1)
i.temp = seq(1,first.country.start-1) #not to miss the double "armenia"
value.data <- raw.data[-i.temp]
# always have 20 values per country 
mydf <- data.frame(matrix(value.data, nrow=total.rows, ncol=23, byrow=TRUE))
# subset to remove the 1st column
my.df = mydf[-1]
# setting the names
df.names = as.character(seq(2014,1994))
df.names = c("Country",df.names)
names(my.df) = df.names

 

21 March – MakeOverMonday

Andy suggested a make over of Guardian’s dynamic visualization of Women’s right across various countries.

Instead of keeping all of them together, I made a box plot for each metric, such as Abortion separately. And then combined all six together for the final plot.

For individual charts, I counted the number of Yes to various questions and this “YesCount” is plotted on the x-axis. I mapped the countries to region, using the country index, included with the data set. However there were 24 countries without an index, which were excluded from further analysis.

test
Womens Rights

I have overlaid the individual observations as dots on top, with a little jitter to show their density. I am not too sure what happened with “Equality”.

Here is the R Code:

library(dplyr)
library(ggplot2)
library(tidyr)
# --------------------------------------------------------------------
# Global Settings
# --------------------------------------------------------------------
title.color = "darkolivegreen"
# --------------------------------------------------------------------
df.region = read.csv("Region Info.csv")
df.region <- df.region %>%
 select(Region,Country,Country.Code)
# --------------------------------------------------------------------
# Abortion
# --------------------------------------------------------------------
df.abortion = read.csv("Abortion.csv")
names(df.abortion) = c("Country","Q1","Q2","Q3","Q4","Q5","Q6","Q7")
df.m.abortion <- df.abortion %>%
 gather(key=Country)
names(df.m.abortion)[2] = "Question"
df.m.abortion.reduced <- df.m.abortion %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.abortion.region <- left_join(df.m.abortion.reduced, df.region)
df.abortion.2 = df.abortion.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.abortion.2$Region = as.factor(df.abortion.2$Region)
p1.1 <- ggplot(df.abortion.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width=.2), size=2,alpha=0.2) +
 coord_flip()
p1.2 = p1.1 + ggtitle("Abortion")
p1.3 <- p1.2 +
 theme_minimal() +
 theme(
 #panel.grid.minor = element_blank(),
 #panel.grid.major = element_blank(),
 #axis.text.y = element_blank(),
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p1.4 <- p1.3 + 
 scale_y_continuous(breaks=seq(1, 7, 1)) # Ticks from 1-7, every 1
# final plot
p1.4
# --------------------------------------------------------------------
# unable to set the palette
library("RColorBrewer", lib.loc="/usr/local/lib/R/site-library")
display.brewer.pal(7,"Pastel1")
mypalette<-brewer.pal(7,"Pastel1")
# p7 <- p6 + 
# scale_fill_brewer(mypalette)
# removing df not required anymore
rm(df.abortion,df.m.abortion,df.m.abortion.reduced,df.abortion.region)
# --------------------------------------------------------------------
# Constitution
# --------------------------------------------------------------------
df.constitution = read.csv("Constitution.csv")
names(df.constitution) = c("Country","Q1","Q2","Q3")
df.m.constitution <- df.constitution %>%
 gather(key=Country)
names(df.m.constitution)[2] = "Question"
df.m.constitution.reduced <- df.m.constitution %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.constitution.region <- left_join(df.m.constitution.reduced, df.region)
df.constitution.2 = df.constitution.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.constitution.2$Region = as.factor(df.constitution.2$Region)
# p2.1 <- ggplot(df.constitution.2, aes(x=Region,y=YesCount)) +
# geom_boxplot(aes(fill=Region,colour=Region),alpha=I(0.3)) + 
# geom_jitter(position=position_jitter(width = 0.2, height = 0.01), size=2,alpha=0.5) +
# coord_flip()
p2.1 <- ggplot(df.constitution.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p2.2 = p2.1 + ggtitle("Constitution")
p2.3 <- p2.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p2.4 <- p2.3 + 
 scale_y_continuous(breaks=seq(1, 3, 1)) # Ticks from 1-3, every 1
# final plot
p2.4
# removing df not required anymore
rm(df.constitution,df.m.constitution,df.m.constitution.reduced,df.constitution.region)
# --------------------------------------------------------------------
# Domestic Violence
# --------------------------------------------------------------------
df.violence = read.csv("Domestic Violence.csv")
names(df.violence) = c("Country","Q1","Q2","Q3","Q4","Q5","Q6","Q7")
df.m.violence <- df.violence %>%
 gather(key=Country)
names(df.m.violence)[2] = "Question"
df.m.violence.reduced <- df.m.violence %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.violence.region <- left_join(df.m.violence.reduced, df.region)
df.violence.2 = df.violence.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.violence.2$Region = as.factor(df.violence.2$Region)
p3.1 <- ggplot(df.violence.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p3.2 = p3.1 + ggtitle("Violence")
p3.3 <- p3.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p3.4 <- p3.3 + 
 scale_y_continuous(breaks=seq(1, 7, 1)) # Ticks from 1-7, every 1
# final plot
p3.4
# removing df not required anymore
rm(df.violence,df.m.violence,df.m.violence.reduced,df.violence.region)
# --------------------------------------------------------------------
# Other Equality
# --------------------------------------------------------------------
df.equality = read.csv("Other Equality.csv")
names(df.equality) = c("Country","Q1","Q2","Q3")
df.m.equality <- df.equality %>%
 gather(key=Country)
names(df.m.equality)[2] = "Question"
df.m.equality.reduced <- df.m.equality %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.equality.region <- left_join(df.m.equality.reduced, df.region)
df.equality.2 = df.equality.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.equality.2$Region = as.factor(df.equality.2$Region)
p4.1 <- ggplot(df.equality.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p4.2 = p4.1 + ggtitle("Equality")
p4.3 <- p4.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p4.4 <- p4.3 + 
 scale_y_continuous(breaks=seq(1, 3, 1)) # Ticks from 1-3, every 1
# final plot
p4.4
# removing df not required anymore
rm(df.equality,df.m.equality,df.m.equality.reduced,df.equality.region)
# --------------------------------------------------------------------
# Property
# --------------------------------------------------------------------
df.property = read.csv("Property.csv")
names(df.property) = c("Country","Q1","Q2","Q3")
df.m.property <- df.property %>%
 gather(key=Country)
names(df.m.property)[2] = "Question"
df.m.property.reduced <- df.m.property %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.property.region <- left_join(df.m.property.reduced, df.region)
df.property.2 = df.property.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.property.2$Region = as.factor(df.property.2$Region)
p5.1 <- ggplot(df.property.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p5.2 = p5.1 + ggtitle("Property")
p5.3 <- p5.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p5.4 <- p5.3 + 
 scale_y_continuous(breaks=seq(1, 3, 1)) # Ticks from 1-3, every 1
# final plot
p5.4
# removing df not required anymore
rm(df.property,df.m.property,df.m.property.reduced,df.property.region)
# --------------------------------------------------------------------
# Work
# --------------------------------------------------------------------
df.work = read.csv("Work.csv")
names(df.work) = c("Country","Q1","Q2","Q3","Q4","Q5","Q6","Q7")
df.m.work <- df.work %>%
 gather(key=Country)
names(df.m.work)[2] = "Question"
df.m.work.reduced <- df.m.work %>%
 group_by(Country) %>%
 filter(value=="Yes") %>%
 summarise(YesCount = n()) %>%
 arrange(YesCount)
df.work.region <- left_join(df.m.work.reduced, df.region)
df.work.2 = df.work.region %>%
 filter(!is.na(Region)) %>% ## excluding those with no Region
 select(Country,YesCount,Region)
df.work.2$Region = as.factor(df.work.2$Region)
p6.1 <- ggplot(df.work.2, aes(x=Region,y=YesCount)) +
 geom_boxplot(aes(fill=Region),alpha=I(0.5)) + 
 geom_jitter(position=position_jitter(width = 0.2, height = 0.09), size=2,alpha=0.2) +
 coord_flip()
p6.2 = p6.1 + ggtitle("Work")
p6.3 <- p6.2 +
 theme_minimal() +
 theme(
 axis.ticks.y = element_blank(),
 axis.ticks.x = element_blank(),
 axis.title.y = element_blank(), # Remove y - axis label
 axis.title.x = element_blank(), # Remove x-axis label
 legend.position = "none", # np legend 
 plot.title=element_text(family="Times", face="bold", size=20, color=title.color)
 )
p6.4 <- p6.3 + 
 scale_y_continuous(breaks=seq(1, 7, 1)) # Ticks from 1-3, every 1
# final plot
p6.4
# removing df not required anymore
rm(df.work,df.m.work,df.m.work.reduced,df.work.region)
# --------------------------------------------------------------------
# Combining all
# --------------------------------------------------------------------
# install.packages("cowplot")
# require(cowplot)
# library(cowplot)
require(gridExtra)
# pdf("foo.pdf")
# grid.arrange(p1.4,p2.4, p3.4,p4.4, p5.4,p6.4,ncol=2, nrow=3)
# dev.off()
# hint from https://jonkimanalyze.wordpress.com/2014/03/26/ggplot2-arrangegrob-arrange-ggplots-on-a-page/
grobframe <- arrangeGrob(p1.4,p2.4, p3.4,p4.4, p5.4,p6.4,ncol=2, nrow=3,
main = textGrob("\nWomen's Rights", gp = gpar(fontsize=60, fontface="bold")),
sub = textGrob("*X-Axis depicts Total Number of Yes to Different Parameters*", x=0, hjust=-0.5, vjust=0.1, gp = gpar(fontface = "italic", fontsize = 15)))
print(grobframe)
ggsave(filename="test.png",plot=grobframe,width = 20, height = 25, units = "in",dpi=400)