Carbon Emissions

Kaggle now contains some interesting data sets, one such is Carbon Emissions. It contains carbon dioxide emissions from electricity generation broken down by fuel type such as coal and electricity.

I have started with the basic exploratory analysis after cleaning up the data. Here is the box plot of various fuel types spread across the years:

facet-category-plot

I will keep adding more to it.

Here is the code:

---
title: "Exploratory Analysis"
author: "Pattern Project"
date: "09 November 2016"
output:
html_document:
fig_width: 10
fig_height: 10
theme: spacelab
highlight: kate
---

## `````````````````````````````````````````````
#### Read Me ####
## `````````````````````````````````````````````
## Version Log:
## v 0 0 1 : Read Loop. Basic Graph
## v 0 0 2 : Why 13 Observations for a year, and not 12 (They had included an yearly aggregate, also description col shows the type of emission, which was excluded earlier)
## v 0 0 3 : Basic Graph now for 12 values / year
## v 0 0 4 : Graph 2 with lines connecting the year
## v 0 0 5 : Graph 3, histogram, box plot
## v 0 0 6 : Graph 4, now a loop of box plot for varoius categories
## v 0 0 7 : Box Plot of all catgories
## v 0 0 8 :
## v 0 0 9 :

## TODO:
# http://www.stat.pitt.edu/stoffer/tsa4/R_toot.htm


## `````````````````````````````````````````````

Load Libraries
```{r, message = F, warning = F}
## `````````````````````````````````````````````
#### Load Libraries ####
## `````````````````````````````````````````````
library(dplyr) # for df maninpulation
library(readr) # for file I/O
library(purrr) # for map functions
library(stringr) # for string functions
library(tidyr) # for melt functions
library(lubridate) # for date/times.
#library(anytime) # for date/times.
library(ggplot2) # for plots
library(forcats) # for factors
library(viridis) # for color palette
library(ggthemes) # clean theme for ggplot2
library(scales) # for plot label formatting
library(gridExtra) # for arranging individual ggplot objects
library(DT) # for data.frame output
library(knitr)
library(grid)
library(RColorBrewer)
## `````````````````````````````````````````````
```


Helper Functions
```{r}

## `````````````````````````````````````````````
#### Helper Functions ####
## `````````````````````````````````````````````

# src:
# http://stackoverflow.com/questions/8425409/file-path-issues-in-r-using-windows-hex-digits-in-character-string-error

# Simply copy the path to your clipboard (ctrl + c) and then run the function as pathPrep()

pathPrep <- function(path = "clipboard") {
y <- if (path == "clipboard") {
readClipboard()
} else {
cat("Please enter the path:\n\n")
readline()
}
x <- chartr("\\", "/", y)
writeClipboard(x)
return(x)
}

# SRC:
# http://minimaxir.com/2015/02/ggplot-tutorial/

fte_theme <- function() {

# Generate the colors for the chart procedurally with RColorBrewer
palette <- brewer.pal("Greys", n=9)
color.background = palette[2]
color.grid.major = palette[3]
color.axis.text = palette[6]
color.axis.title = palette[7]
color.title = palette[9]

# Begin construction of chart
theme_bw(base_size=9) +

# Set the entire chart region to a light gray color
theme(panel.background=element_rect(fill=color.background, color=color.background)) +
theme(plot.background=element_rect(fill=color.background, color=color.background)) +
theme(panel.border=element_rect(color=color.background)) +

# Format the grid
theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) +
theme(panel.grid.minor=element_blank()) +
theme(axis.ticks=element_blank()) +

# Format the legend, but hide by default
#theme(legend.position="none") +
theme(legend.background = element_rect(fill=color.background)) +
theme(legend.text = element_text(size=7,color=color.axis.title)) +

# Set title and axis labels, and format these and tick marks
theme(plot.title=element_text(color=color.title, size=10, vjust=1.25)) +
theme(axis.text.x=element_text(size=7,color=color.axis.text)) +
theme(axis.text.y=element_text(size=7,color=color.axis.text)) +
theme(axis.title.x=element_text(size=8,color=color.axis.title, vjust=0)) +
theme(axis.title.y=element_text(size=8,color=color.axis.title, vjust=1.25)) +

# Plot margins
theme(plot.margin = unit(c(0.35, 0.2, 0.3, 0.35), "cm"))
}

# Multip Plot Function
# SRC:
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)

# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)

numPlots = length(plots)

# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}

if (numPlots==1) {
print(plots[[1]])

} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}


## `````````````````````````````````````````````

```

Read Input Data
```{r}
# Read the data
flag_local = 1

# for local use
if(flag_local == 1)
{
#print("if")

# this is for this chunk only
setwd("D:/2. Bianca/1. Perso/13. Kaggle/6. d - Carbon Emissions")
# for notebook we need to use the following
# https://github.com/yihui/knitr/issues/277
opts_knit$set(root.dir = 'D:/2. Bianca/1. Perso/13. Kaggle/6. d - Carbon Emissions')

# creating the file path for zip files
ch.zip.path = file.path(getwd(),"2. Data")

# extracting all csv files
ch.csv.files <-
list.files(path=ch.zip.path,pattern = "\\.csv$", full.names = TRUE)


df.master <- read.csv(
ch.csv.files,
header = TRUE,
stringsAsFactors = FALSE,
# containts a lot of values with "Not Available"
na.strings = c("", "NA", "Not Available")
)


} else {
# else requires identation
# http://stackoverflow.com/questions/14865435/unexpected-else-in-else-error
#print("else")

df.master <- read.csv(
"../input/MER_T12_06.csv",
header = TRUE,
stringsAsFactors=FALSE,
# containts a lot of values with "Not Available"
na.strings = c("", "NA", "Not Available")
)

}

# converting to tibble
t.master = as_data_frame(df.master)
rm(df.master)

# convert the names to lowercase
names(t.master) <- tolower(names(t.master))

# have a look
t.master %>% glimpse()

# cols to keep
ch.keep = c("yyyymm","value", "description")

t.1 <-
t.master %>%
select(one_of(ch.keep))

t.1

# clean up
#rm(t.master)
rm(ch.csv.files)
rm(ch.keep)
rm(ch.zip.path)
```

Input Conversion Issue
```{r}
##
# debugging the conversion to numeric issue
# the problematic marked rows had value "Not Available" instead of NA
# fixed in the read.csv by adding na.strings

t.1 %>% filter(is.na(value))

t.1$value[1000:2000] %>% as.numeric()
t.1$value[2000:3000] %>% as.numeric()
t.1$value[3000:4000] %>% as.numeric() # problematic
t.1$value[4000:5000] %>% as.numeric()

t.1$value[3000:3500] %>% as.numeric() # problematic
t.1$value[3390:3500] %>% as.numeric() # problematic
```


Basic Manipulation
```{r}
t.1 <- t.1 %>%
# remove any na values
na.omit() %>%
# seperate out year and month
mutate(
dummy = as.character(yyyymm),
year = substr(dummy, 0, 4),
year = as.factor(year),
month = substr(dummy, 5, 6),
month = as.factor(month),
value = as.numeric(value)
) %>%
select(year, month, value, description, -dummy, -yyyymm)

#t.1 %>% filter(is.na(value))

# removing the 13th aggregate value
t.1 <-
t.1 %>%
filter(! month == 13)

# drop labels
#t.1$month %>% droplevels.factor() %>% glimpse()
t.1$month <-
t.1$month %>% droplevels.factor()

# convert month numbers to names, using a built-in constant:
levels(t.1$month) <- month.abb

t.1 %>% glimpse()

t.1
```
Digging into the Description col
```{r}
t.1 %>%
filter(year %in% c(1973:1973)) %>%
select(description) %>%
unique()
```
Making the Description Col Shotter
```{r}
t.1 <-
t.1 %>%
mutate(
description2 = recode(
.$description,
"Coal Electric Power Sector CO2 Emissions" = "coal",
"Natural Gas Electric Power Sector CO2 Emissions" = "natural gas",
"Distillate Fuel, Including Kerosene-Type Jet Fuel, Oil Electric Power Sector CO2 Emissions" = "distillate fuel",
"Petroleum Coke Electric Power Sector CO2 Emissions" = "petroleum coke",
"Residual Fuel Oil Electric Power Sector CO2 Emissions" = "residual fuel",
"Petroleum Electric Power Sector CO2 Emissions" = "petroleum electric power",
"Total Energy Electric Power Sector CO2 Emissions" = "total energy"
)
)

t.1 <-
t.1 %>%
select(-description) %>%
rename(description = description2) %>%
mutate(description = as.factor(description))

t.1
```



Adding a data column
```{r}
# adding a data column
t.1 <-
t.1 %>%
mutate(date = paste("01",month,year) %>% as.Date("%d %b %Y"))

```

Basic Line Plot Showing trend across the years from 1973 to 1974
```{r}
t.2 <-
t.1 %>%
filter(!month == "13") %>%
filter(year %in% c(1973:1980))

g.1 <- ggplot(t.2, aes(x=month, y=value, group=year))
g.1 <- g.1 + geom_point(aes(colour = year)) + geom_line(aes(colour = year))
g.1 <- g.1 + facet_wrap(~ description, ncol=2)
g.1 <- g.1 + fte_theme()
g.1 <- g.1 + labs(title = "Carbon Emissions from Elecricity Generation, Across the Years", x = "Months", y = "Value (in Million Metric Tons of Carbon Dioxide")
g.1
```

Histogram of Values from 1973 to 1980
```{r}
g.2 <- ggplot(t.2 %>% subset(year %in% c(1973:1980)), aes(x=value))
g.2 <- g.2 + geom_histogram(
aes(fill = year),
#fill = "white",
binwidth = 0.5,
alpha = 0.5,
position = "identity"
)
g.2 <- g.2 + facet_wrap(~ description, ncol=2)
g.2 <- g.2 + fte_theme()
g.2
```

Box Plot of "Coal"
```{r}
g.3 <- ggplot(t.1 %>% subset(description %in% c("coal")))
g.3 <- g.3 + geom_boxplot(
aes(x=year, y=value, fill=year)
)
g.3 <- g.3 + fte_theme()
g.3 <- g.3 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g.3 <- g.3 + theme(legend.position="none")
g.3
```
Looping over different categories / descriptions
```{r}
plots <- t.1 %>%
split(.$description) %>%
map(~ ggplot(.) +
geom_boxplot(aes(x = year, y = value, fill = year)) +
fte_theme() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.position="none") +
labs(title = .$description)
) #%>%
#walk(print)

# fancy printing
multiplot(plotlist = plots, cols = 2)

```


# Fin

Thanks for reading.

 

 

A Story about Persian Apple

Peach was introduced to the world via Persia, evident in its ancient appellation Persian apple or malum persicum. Lets dig into some data.

The data set contains the global production across the years for various countries. The first attempt hovers around newly found “purrr” package and bent upon using it. I skip the mundane step of loading a few libraries.

## `````````````````````````````````````````````
#### Helper Function ####
## `````````````````````````````````````````````
save_df = function(df,f.name,flag)
{
if(flag)
{
# removing initial x added to the col names
names(df) = gsub("x", "", names(df))

}

q.f.name = file.path("2. Data", f.name)
write.csv(x=df, file=q.f.name,row.names=FALSE)

}

## for subtitles
# http://bayesball.blogspot.com/2016/03/adding-subtitle-to-ggplot2.html

library(grid)
library(gtable)

ggplot_with_subtitle &amp;lt;- function(gg,
label="",
fontfamily=NULL,
fontsize=10,
hjust=0, vjust=0,
bottom_margin=5.5,
newpage=is.null(vp),
vp=NULL,
...) {

if (is.null(fontfamily)) {
gpr &amp;lt;- gpar(fontsize=fontsize, ...)
} else {
gpr &amp;lt;- gpar(fontfamily=fontfamily, fontsize=fontsize, ...)
}

subtitle &amp;lt;- textGrob(label, x=unit(hjust, "npc"), y=unit(hjust, "npc"),
hjust=hjust, vjust=vjust,
gp=gpr)

data &amp;lt;- ggplot_build(gg)

gt &amp;lt;- ggplot_gtable(data)
gt &amp;lt;- gtable_add_rows(gt, grobHeight(subtitle), 2)
gt &amp;lt;- gtable_add_grob(gt, subtitle, 3, 4, 3, 4, 8, "off", "subtitle")
gt &amp;lt;- gtable_add_rows(gt, grid::unit(bottom_margin, "pt"), 3) if (newpage) grid.newpage() if (is.null(vp)) { grid.draw(gt) } else { if (is.character(vp)) seekViewport(vp) else pushViewport(vp) grid.draw(gt) upViewport() } invisible(data) } ## ````````````````````````````````````````````` ## ````````````````````````````````````````````` #### Read Data #### ## ````````````````````````````````````````````` setwd("") # set as required ## df.master #### df.master = read.csv( "2. Data/Global Peach Index.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = c("", "NA") ) ## secondary data ## N/A ## ````````````````````````````````````````````` ## ````````````````````````````````````````````` #### Manipulate Data #### ## ````````````````````````````````````````````` ### fixing primary data set (df.master) names(df.master) = tolower(names(df.master)) # master replica df.1 = df.master # removing redundant cols if ((length(unique(df.1$item))) ==1) { df.1 = df.1 %&amp;gt;%
select(-item)
}

if ((length(unique(df.1$metric))) ==1)
{
df.1 =
df.1 %&amp;gt;%
select(-metric)
}

## workhorse fn
# inputs: numeric vector
# outputs: data frame
# implementation: return df with
# 1. mean
# 2. count of values above and below mean
fn_calc_prop = function(my.vector)
{
#print("entering ---- ")
#print(my.vector)
#print(str(my.vector))

# removing na
my.vector = my.vector %&amp;gt;% na.omit()

i.mean = my.vector %&amp;gt;% mean()
#print(" mean")
#print(i.mean)

df.temp = data.frame(mean=i.mean,gt=0,lt=0)

df.temp$gt =
as.data.frame(my.vector) %&amp;gt;%
filter(. &amp;gt; i.mean) %&amp;gt;%
summarise(gt = n())

df.temp$lt =
as.data.frame(my.vector) %&amp;gt;%
filter(. &amp;lt;= i.mean) %&amp;gt;%
summarise(lt = n())

return (df.temp)
}

df.summary =
df.1 %&amp;gt;%
# filter numeric cols
keep(is.numeric) %&amp;gt;%
# for each numeric col
# apply "fn_calc_prop"
map_df(fn_calc_prop)

# appending years cols back
v.names = names(df.1)[-1] # except the first col containing country
df.summary$year = v.names
# removing initial x added to the col names
df.summary$year = gsub("x", "", df.summary$year)

# re-arranging cols for visibility
df.summary =
df.summary %&amp;gt;%
select(year,mean,gt,lt)

# flattening out the list of gt and lt
# dont know why are list formed
df.summary$gt = unlist(df.summary$gt)
df.summary$lt = unlist(df.summary$lt)

# store df
save_df(df.summary,"df.summary.csv",FALSE)

to.discretise.vars &amp;lt;- c( "mean" ) df.summary.2 = df.summary %&amp;gt;%
lmap_at(to.discretise.vars, cut_categories)

# http://rforpublichealth.blogspot.com/2012/09/from-continuous-to-categorical.html
# df.summary$mean %&amp;gt;%
# cut(breaks=c(100000,200000,300000,400000,500000),labels=c(1:4))

df.summary$cat =
df.summary$mean %&amp;gt;%
cut(breaks=c(100000,200000,300000,400000,500000),labels=c("1K+","2K+","3K+","4K+"))

# instead of messing with scales, transform the data
df.summary$mean.k &amp;lt;- df.summary$mean/1000 # proportion of countries above the mean df.summary.3 = df.summary %&amp;gt;%
#select(gt,lt) %&amp;gt;%
mutate(my.sum = gt + lt, gt.mean = round((gt / my.sum),3)) %&amp;gt;%
select(-my.sum)

## `````````````````````````````````````````````

With all the munging done, lets do some ggplot2 magic …


## `````````````````````````````````````````````
#### Plot 1 ####
## `````````````````````````````````````````````

fat.casual.bbq = c("#716065", "#cb8052", "#f9ded0", "#f961a4","#57324e")

# http://stackoverflow.com/questions/25937000/ggplot2-error-discrete-value-supplied-to-continuous-scale
g.1 = ggplot(df.summary, aes(x = year, y = mean.k)) + 
 geom_point(aes(fill=cat),size = 2,shape=21) +
 scale_fill_manual(values=fat.casual.bbq) + 
 xlab("") + ylab("") + 
 theme_minimal() 

# making the x-label horizontal
g.3 = g.1 + theme(text = element_text(size = 10),
 axis.text.x = element_text(angle = 90, hjust = 1))

g.3 = g.3 + ggtitle("World Peach Production")

palette &amp;lt;- c("#FFFFFF", "#F0F0F0", "#D9D9D9", "#BDBDBD", "#969696", "#737373",
 "#525252", "#252525", "#000000") # = brewer.pal 'greys'
color.background = palette[2]
color.grid.major = palette[3]
color.axis.text = palette[6]
color.axis.title = palette[7]
color.title = palette[9]
#theme_bw(base_size=9) 

g.3 = g.3 + theme(
 panel.background=element_rect(fill=color.background, color=color.background),
 plot.background=element_rect(fill=color.background, color=color.background),
 #panel.border=element_rect(color=color.background)
 plot.title=element_text(color=color.title, size=16, vjust=1.25, hjust=0),
 #plot.title = element_text(hjust=0, size=16)
 plot.margin = unit(c(0.35, 0.2, 0.3, 0.35), "cm"),

 # Position legend in graph, where x,y is 0,0 (bottom left) to 1,1 (top right)
 legend.position=c(.15, .9),

 legend.box = "horizontal",
 legend.direction = "horizontal",
 legend.title= element_text(size=0),
 legend.text=element_text(size=6),
 legend.key.size=unit(0.2, "cm"),
 legend.key.width=unit(0.5, "cm")

)

g.3

g.4 &amp;lt;- g.3 + annotate("text", x = 30, y = 200, label = "1999 &amp;gt;&amp;gt; mean production crosses 200K",
 color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.5 &amp;lt;- g.4 + annotate("text", x = 35, y = 300, label = "2005 &amp;gt;&amp;gt; mean production crosses 300K",
 color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.6 &amp;lt;- g.5 + annotate("text", x = 42, y = 390, label = "2011 &amp;gt;&amp;gt; mean production crosses 400K",
 color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.6

# trying to reduce space between axis and plot
# but does not work

# # http://stackoverflow.com/questions/20220424/ggplot2-bar-plot-no-space-between-bottom-of-geom-and-x-axis-keep-space-above
# g.6 + 
# coord_cartesian(xlim = c(1961,2012), ylim = c(0,450))
# #scale_x_continuous(limits = c(1961,2012), expand = c(0, 0)) +
# #scale_y_continuous(limits = c(0,420), expand = c(0, 0)) +
# 
# g.6 + 
# geom_blank(aes(y=1.1*..count..), stat="count") 

## subtitle
# set the name of the current plot object to `gg`
gg &amp;lt;- g.6

# define the subtitle text
subtitle &amp;lt;- 
 "Rapid increase since 2005, taking only 6 years to cross another 100K"

p1 = ggplot_with_subtitle(gg, subtitle,
 bottom_margin=20, lineheight=0.9)

p1

## `````````````````````````````````````````````

This is what it generates:

plot-1

Onto the next one:

## `````````````````````````````````````````````
#### Plot 2 ####
## `````````````````````````````````````````````
g.1 = ggplot(df.summary.3, aes(x = year, y = gt.mean)) +
geom_point(aes(fill=cat),size = 2,shape=21) +
scale_fill_manual(values=fat.casual.bbq) +
xlab("") + ylab("") +
theme_minimal()

# making the x-label horizontal
g.3 = g.1 + theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 90, hjust = 1))


g.3 = g.3 + ggtitle("A Few Dominate the World Peach Production")

palette <- c("#FFFFFF", "#F0F0F0", "#D9D9D9", "#BDBDBD", "#969696", "#737373",
"#525252", "#252525", "#000000") # = brewer.pal 'greys'
color.background = palette[2]
color.grid.major = palette[3]
color.axis.text = palette[6]
color.axis.title = palette[7]
color.title = palette[9]
#theme_bw(base_size=9)


g.3 = g.3 + theme(
panel.background=element_rect(fill=color.background, color=color.background),
plot.background=element_rect(fill=color.background, color=color.background),
#panel.border=element_rect(color=color.background)
plot.title=element_text(color=color.title, size=16, vjust=1.25, hjust=0),
#plot.title = element_text(hjust=0, size=16)
plot.margin = unit(c(0.35, 0.2, 0.3, 0.35), "cm"),

# Position legend in graph, where x,y is 0,0 (bottom left) to 1,1 (top right)
legend.position=c(.85, .8),

legend.box = "horizontal",
legend.direction = "horizontal",
legend.title= element_text(size=0),
legend.text=element_text(size=6),
legend.key.size=unit(0.2, "cm"),
legend.key.width=unit(0.5, "cm")



)

g.3

# http://stackoverflow.com/questions/24237399/how-to-select-the-rows-with-maximum-values-in-each-group-with-dplyr
df.summary.3 %>%
filter(gt.mean == max(gt.mean))

# max(df.summary.3$gt.mean)

g.4 <- g.3 + annotate("text", x = 28, y = 0.265, label = "1975 >> highest proportion, where mean production is 126K",
color="#7a7d7e", size=3, vjust=-1, fontface="bold")

#g.4

df.summary.3 %>%
filter(gt.mean == min(gt.mean))


g.5 <- g.4 + annotate("text", x = 39, y = 0.08, label = "2010~12 >> lowest proportion, where mean production is ~400K",
color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.5

## subtitle
# set the name of the current plot object to `gg`
gg <- g.5

# define the subtitle text
subtitle <-
"The proportion of countries above the mean production for a given year, on Y-Axis"

p2 = ggplot_with_subtitle(gg, subtitle,
bottom_margin=20, lineheight=0.9)



## `````````````````````````````````````````````

## `````````````````````````````````````````````
#### Clean up ####
## `````````````````````````````````````````````
# rm(list=ls())

Here is the result:

plot-2-v-1

It would be interesting to compare the first plot with how world population has grown compared to the peach production.

What’s On The Menu (2)

Inspired from Rasmus Bååth’s post “A Fun Gastronomical Dataset: What’s on the Menu?” I set out to do something similar. Starting from the CSV file kindly shared by the maestro.

I thought of categorising the various food items and analyse the trend of healthy eating over time. However, I was stumped fairly quickly, failing to execute matches for words , though documentation suggested it being possible using “boundary”.

Sadly “Chateau” and “Steak” both matched the keyword “tea”. My unsuccessful attempt looked something like this – and various other combos :

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word"))))

Barging ahead – prices seemed a good starting point, as to how have they fared across the years.

food_over_time <- map_df(food, function(food)
{

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word")))) %>%
 mutate(food = food) %>%
 group_by(year,food) %>%
 summarise(avg_price = mean(price,na.rm = TRUE))

}) # end food_over_time

I ended up with a lot of NAN(s) and as stated here and here, they are solely my doing. On a side note, Peter Bashai mentions a few methods to plot the missing values in d3.

Ignoring  why NAN were produced, I decide to simply replace them with NA

# is.nan is provided to check specifically for NaN
food_over_time %>% 
 filter(is.nan(avg_price))

food_over_time %>% 
 mutate(avg_price = ifelse(is.nan(avg_price),NA,avg_price))

And then imputing the data. Using the default values from mice package.

# imputing it
imp_food_over_time = 
 food_over_time %>%
 mice()

summary(imp_food_over_time)

imp_data = complete(imp_food_over_time,1)

Relying on the ggplot snipped from the original post

# A reusable list of ggplot2 directives to produce a lineplot
food_time_plot <- list(
 geom_line(),
 geom_point(),
 facet_wrap(~ food),
 theme_minimal(),
 theme(legend.position = "none"))

And now plotting:

food_over_time %>% filter(food %in% c("coffee", "tea")) %>%
ggplot(aes(year, avg_price , color = food)) + food_time_plot

this is what I get:

rplot2

ooops …

we can remove that peaking value of 2103 in year 1999 but something is fundamentally wrong there . Here is the data set for 1999

# what happened to tea in 1999
df.tea.99 = 
df.master %>%
 filter(year == 1999) %>%
 filter(str_detect(dish_name, regex("tea", ignore_case = TRUE, boundary=type("word")))) %>%
 select(year,dish_name,price)

and the resulting data set has entries like “Chateaubriand tranche a table, aux legumes en sauce bearnaise” which have nothing to do with “tea”