Spark on Windows 7

Recently I have been yearning to explore NYC Bike Data for ridership trends. However the size of data and humbleness of laptop forced me otherwise. I will soon be writing in detail on how to setup a CSV to SQLite Database in R. This post however focuses on one of the other alternates – Apache Spark. Specifically installing “Apache Spark” on a “Windows 7” laptop.

This post is more of a compilation of various sources from internet which helped me install it, specifically [1] and [2]. All the credit to these folks for such wonderful guidance.


1. CAVEATS

  1. You will need Admin rights on the machine.
  2. Apache Hadoop NOT mandatory to work with Spark or run Spark applications.
  3. Cautiously avoid having any spaces between the folder names hence “D:\Perso2” is better than “D:\Perso 2 3” and will save you a lot of trouble later.
  4. Go back and read 3.
  5. For setting up environment variables, refer to #10
  6. This guide limits itself to Spark Installation. And does not go into how to use Spark once it is set up.

2. SPARK BINARIES

  • Download a pre-built Spark binary for Hadoop from here.

1 Spark download

  • Unzip the *.tar file by using WinRar. (Refer to caveat #3)
  • The benefit of using a pre-built binary is that you will not have to go through the trouble of building the spark binaries from scratch.
  • Setup the path variable SPARK_HOME to whatever path you extracted the binaries to. (Refer to caveat #3). For me it was: D:\Perso2\spark\bin

3. WIN UTILS

  • The official release of Hadoop does not include the required binaries (like winutils.exe) which are required to run Hadoop.
  • One should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2 from here.
  • You are free to place it anywhere, but do make a “bin” sub-folder. For me it was: D:\Perso2\winutils\bin.
  • Setup the path variable. HADOOP_HOME to whatever path you extracted the binaries to. (Refer to caveat #3).

4.JAVA

  • Download and install latest version of Java JDK.
  • The default path will be something like “C;//Program Files ….” where there is a space between Program and Files, and was a cause of trouble for me. I recommend to use another folder, as you would not want your other installed software to have any issues.
  • Setup the path variable. JAVA_HOME to whatever path you extracted the binaries to. (Refer to caveat #3). For me it was: C:\Java\jdk1.8.0_131\
  • To check is java is already installed.2. Check Java Version

5.TEMP DIR

  • Create C:\tmp\hive directory. It is the default value of exec.scratchdir configuration propertyin Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.
  • To set the write privileges, execute the following via command prompt. 3. Write Priviledges.pngwhere: D:\Perso2\winutils\bin was the path where winutils.exe was stored.
  • Some sources suggest it with –R switch which did not work for me.
  • Do check that the permissions have been granted as required. (highlighted in yellow)4. Check Priviledge.png

6.PATH ENVIRONMENT VARIABLE

  • Append these system variables namely SPARK_HOME, HADOOP_HOME and JAVA_HOME to PATH variable.

%JAVA_HOME%\BIN; %HADOOP_HOME%; %SPARK_HOME%

  • It is important to put a semicolon to separate these entries.
  • Do check that the path is setup correctly. Simply path > path.txt

7.RUNNING SPARK

  • From the command prompt, change to spark directory, and then to bin sub directory. For me it was D:\Perso2\spark\bin. Refer to #9 for easier command prompt handling
  • Run the command “spark-shell” and you should see the spark logo with the scala prompt5. Spark Shell - a.png5. Spark Shell - b

8. SPARK JOBS

  • Fire up your  browser. Type “localhost:4040” in the address bar and voila.6. Spark Jobs

 


9. OPENING CMD PROMPT IN SPECIFIC FOLDER

If you’re already in the directory you want, you can:

  • Hold down Shift when opening the Explorer File menu, then click on “Open command window here”. If you can’t see the menu bar, press Alt-Shift-F – Alt-F to open the File menu, plus Shift.
  • Shift-right-click on the background of the Explorer window, then click on “Open command window here”.

10. SETTING ENVIRONMENT VARIABLES

  • Right click on Computer- Left click on Properties
  • Click on Advanced System Settings
  • Under Start up & Recovery, Click on the button labelled as “Environment Variable”
  • You will see the window divided into two parts, the upper part will read User variables for username and the lower part will read System variables.
  • As part of this post, we will create new system variables, hence click on “New” button under System variable.

11. REFERENCES

  1. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
  2. https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/
  3. http://stackoverflow.com/questions/60904/how-can-i-open-a-cmd-window-in-a-specific-location

Carbon Emissions

Kaggle now contains some interesting data sets, one such is Carbon Emissions. It contains carbon dioxide emissions from electricity generation broken down by fuel type such as coal and electricity.

I have started with the basic exploratory analysis after cleaning up the data. Here is the box plot of various fuel types spread across the years:

facet-category-plot

I will keep adding more to it.

Here is the code:

---
title: "Exploratory Analysis"
author: "Pattern Project"
date: "09 November 2016"
output:
html_document:
fig_width: 10
fig_height: 10
theme: spacelab
highlight: kate
---

## `````````````````````````````````````````````
#### Read Me ####
## `````````````````````````````````````````````
## Version Log:
## v 0 0 1 : Read Loop. Basic Graph
## v 0 0 2 : Why 13 Observations for a year, and not 12 (They had included an yearly aggregate, also description col shows the type of emission, which was excluded earlier)
## v 0 0 3 : Basic Graph now for 12 values / year
## v 0 0 4 : Graph 2 with lines connecting the year
## v 0 0 5 : Graph 3, histogram, box plot
## v 0 0 6 : Graph 4, now a loop of box plot for varoius categories
## v 0 0 7 : Box Plot of all catgories
## v 0 0 8 :
## v 0 0 9 :

## TODO:
# http://www.stat.pitt.edu/stoffer/tsa4/R_toot.htm


## `````````````````````````````````````````````

Load Libraries
```{r, message = F, warning = F}
## `````````````````````````````````````````````
#### Load Libraries ####
## `````````````````````````````````````````````
library(dplyr) # for df maninpulation
library(readr) # for file I/O
library(purrr) # for map functions
library(stringr) # for string functions
library(tidyr) # for melt functions
library(lubridate) # for date/times.
#library(anytime) # for date/times.
library(ggplot2) # for plots
library(forcats) # for factors
library(viridis) # for color palette
library(ggthemes) # clean theme for ggplot2
library(scales) # for plot label formatting
library(gridExtra) # for arranging individual ggplot objects
library(DT) # for data.frame output
library(knitr)
library(grid)
library(RColorBrewer)
## `````````````````````````````````````````````
```


Helper Functions
```{r}

## `````````````````````````````````````````````
#### Helper Functions ####
## `````````````````````````````````````````````

# src:
# http://stackoverflow.com/questions/8425409/file-path-issues-in-r-using-windows-hex-digits-in-character-string-error

# Simply copy the path to your clipboard (ctrl + c) and then run the function as pathPrep()

pathPrep <- function(path = "clipboard") {
y <- if (path == "clipboard") {
readClipboard()
} else {
cat("Please enter the path:\n\n")
readline()
}
x <- chartr("\\", "/", y)
writeClipboard(x)
return(x)
}

# SRC:
# http://minimaxir.com/2015/02/ggplot-tutorial/

fte_theme <- function() {

# Generate the colors for the chart procedurally with RColorBrewer
palette <- brewer.pal("Greys", n=9)
color.background = palette[2]
color.grid.major = palette[3]
color.axis.text = palette[6]
color.axis.title = palette[7]
color.title = palette[9]

# Begin construction of chart
theme_bw(base_size=9) +

# Set the entire chart region to a light gray color
theme(panel.background=element_rect(fill=color.background, color=color.background)) +
theme(plot.background=element_rect(fill=color.background, color=color.background)) +
theme(panel.border=element_rect(color=color.background)) +

# Format the grid
theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) +
theme(panel.grid.minor=element_blank()) +
theme(axis.ticks=element_blank()) +

# Format the legend, but hide by default
#theme(legend.position="none") +
theme(legend.background = element_rect(fill=color.background)) +
theme(legend.text = element_text(size=7,color=color.axis.title)) +

# Set title and axis labels, and format these and tick marks
theme(plot.title=element_text(color=color.title, size=10, vjust=1.25)) +
theme(axis.text.x=element_text(size=7,color=color.axis.text)) +
theme(axis.text.y=element_text(size=7,color=color.axis.text)) +
theme(axis.title.x=element_text(size=8,color=color.axis.title, vjust=0)) +
theme(axis.title.y=element_text(size=8,color=color.axis.title, vjust=1.25)) +

# Plot margins
theme(plot.margin = unit(c(0.35, 0.2, 0.3, 0.35), "cm"))
}

# Multip Plot Function
# SRC:
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)

# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)

numPlots = length(plots)

# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}

if (numPlots==1) {
print(plots[[1]])

} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}


## `````````````````````````````````````````````

```

Read Input Data
```{r}
# Read the data
flag_local = 1

# for local use
if(flag_local == 1)
{
#print("if")

# this is for this chunk only
setwd("D:/2. Bianca/1. Perso/13. Kaggle/6. d - Carbon Emissions")
# for notebook we need to use the following
# https://github.com/yihui/knitr/issues/277
opts_knit$set(root.dir = 'D:/2. Bianca/1. Perso/13. Kaggle/6. d - Carbon Emissions')

# creating the file path for zip files
ch.zip.path = file.path(getwd(),"2. Data")

# extracting all csv files
ch.csv.files <-
list.files(path=ch.zip.path,pattern = "\\.csv$", full.names = TRUE)


df.master <- read.csv(
ch.csv.files,
header = TRUE,
stringsAsFactors = FALSE,
# containts a lot of values with "Not Available"
na.strings = c("", "NA", "Not Available")
)


} else {
# else requires identation
# http://stackoverflow.com/questions/14865435/unexpected-else-in-else-error
#print("else")

df.master <- read.csv(
"../input/MER_T12_06.csv",
header = TRUE,
stringsAsFactors=FALSE,
# containts a lot of values with "Not Available"
na.strings = c("", "NA", "Not Available")
)

}

# converting to tibble
t.master = as_data_frame(df.master)
rm(df.master)

# convert the names to lowercase
names(t.master) <- tolower(names(t.master))

# have a look
t.master %>% glimpse()

# cols to keep
ch.keep = c("yyyymm","value", "description")

t.1 <-
t.master %>%
select(one_of(ch.keep))

t.1

# clean up
#rm(t.master)
rm(ch.csv.files)
rm(ch.keep)
rm(ch.zip.path)
```

Input Conversion Issue
```{r}
##
# debugging the conversion to numeric issue
# the problematic marked rows had value "Not Available" instead of NA
# fixed in the read.csv by adding na.strings

t.1 %>% filter(is.na(value))

t.1$value[1000:2000] %>% as.numeric()
t.1$value[2000:3000] %>% as.numeric()
t.1$value[3000:4000] %>% as.numeric() # problematic
t.1$value[4000:5000] %>% as.numeric()

t.1$value[3000:3500] %>% as.numeric() # problematic
t.1$value[3390:3500] %>% as.numeric() # problematic
```


Basic Manipulation
```{r}
t.1 <- t.1 %>%
# remove any na values
na.omit() %>%
# seperate out year and month
mutate(
dummy = as.character(yyyymm),
year = substr(dummy, 0, 4),
year = as.factor(year),
month = substr(dummy, 5, 6),
month = as.factor(month),
value = as.numeric(value)
) %>%
select(year, month, value, description, -dummy, -yyyymm)

#t.1 %>% filter(is.na(value))

# removing the 13th aggregate value
t.1 <-
t.1 %>%
filter(! month == 13)

# drop labels
#t.1$month %>% droplevels.factor() %>% glimpse()
t.1$month <-
t.1$month %>% droplevels.factor()

# convert month numbers to names, using a built-in constant:
levels(t.1$month) <- month.abb

t.1 %>% glimpse()

t.1
```
Digging into the Description col
```{r}
t.1 %>%
filter(year %in% c(1973:1973)) %>%
select(description) %>%
unique()
```
Making the Description Col Shotter
```{r}
t.1 <-
t.1 %>%
mutate(
description2 = recode(
.$description,
"Coal Electric Power Sector CO2 Emissions" = "coal",
"Natural Gas Electric Power Sector CO2 Emissions" = "natural gas",
"Distillate Fuel, Including Kerosene-Type Jet Fuel, Oil Electric Power Sector CO2 Emissions" = "distillate fuel",
"Petroleum Coke Electric Power Sector CO2 Emissions" = "petroleum coke",
"Residual Fuel Oil Electric Power Sector CO2 Emissions" = "residual fuel",
"Petroleum Electric Power Sector CO2 Emissions" = "petroleum electric power",
"Total Energy Electric Power Sector CO2 Emissions" = "total energy"
)
)

t.1 <-
t.1 %>%
select(-description) %>%
rename(description = description2) %>%
mutate(description = as.factor(description))

t.1
```



Adding a data column
```{r}
# adding a data column
t.1 <-
t.1 %>%
mutate(date = paste("01",month,year) %>% as.Date("%d %b %Y"))

```

Basic Line Plot Showing trend across the years from 1973 to 1974
```{r}
t.2 <-
t.1 %>%
filter(!month == "13") %>%
filter(year %in% c(1973:1980))

g.1 <- ggplot(t.2, aes(x=month, y=value, group=year))
g.1 <- g.1 + geom_point(aes(colour = year)) + geom_line(aes(colour = year))
g.1 <- g.1 + facet_wrap(~ description, ncol=2)
g.1 <- g.1 + fte_theme()
g.1 <- g.1 + labs(title = "Carbon Emissions from Elecricity Generation, Across the Years", x = "Months", y = "Value (in Million Metric Tons of Carbon Dioxide")
g.1
```

Histogram of Values from 1973 to 1980
```{r}
g.2 <- ggplot(t.2 %>% subset(year %in% c(1973:1980)), aes(x=value))
g.2 <- g.2 + geom_histogram(
aes(fill = year),
#fill = "white",
binwidth = 0.5,
alpha = 0.5,
position = "identity"
)
g.2 <- g.2 + facet_wrap(~ description, ncol=2)
g.2 <- g.2 + fte_theme()
g.2
```

Box Plot of "Coal"
```{r}
g.3 <- ggplot(t.1 %>% subset(description %in% c("coal")))
g.3 <- g.3 + geom_boxplot(
aes(x=year, y=value, fill=year)
)
g.3 <- g.3 + fte_theme()
g.3 <- g.3 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g.3 <- g.3 + theme(legend.position="none")
g.3
```
Looping over different categories / descriptions
```{r}
plots <- t.1 %>%
split(.$description) %>%
map(~ ggplot(.) +
geom_boxplot(aes(x = year, y = value, fill = year)) +
fte_theme() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.position="none") +
labs(title = .$description)
) #%>%
#walk(print)

# fancy printing
multiplot(plotlist = plots, cols = 2)

```


# Fin

Thanks for reading.

 

 

A Story about Persian Apple

Peach was introduced to the world via Persia, evident in its ancient appellation Persian apple or malum persicum. Lets dig into some data.

The data set contains the global production across the years for various countries. The first attempt hovers around newly found “purrr” package and bent upon using it. I skip the mundane step of loading a few libraries.

## `````````````````````````````````````````````
#### Helper Function ####
## `````````````````````````````````````````````
save_df = function(df,f.name,flag)
{
if(flag)
{
# removing initial x added to the col names
names(df) = gsub("x", "", names(df))

}

q.f.name = file.path("2. Data", f.name)
write.csv(x=df, file=q.f.name,row.names=FALSE)

}

## for subtitles
# http://bayesball.blogspot.com/2016/03/adding-subtitle-to-ggplot2.html

library(grid)
library(gtable)

ggplot_with_subtitle &amp;lt;- function(gg,
label="",
fontfamily=NULL,
fontsize=10,
hjust=0, vjust=0,
bottom_margin=5.5,
newpage=is.null(vp),
vp=NULL,
...) {

if (is.null(fontfamily)) {
gpr &amp;lt;- gpar(fontsize=fontsize, ...)
} else {
gpr &amp;lt;- gpar(fontfamily=fontfamily, fontsize=fontsize, ...)
}

subtitle &amp;lt;- textGrob(label, x=unit(hjust, "npc"), y=unit(hjust, "npc"),
hjust=hjust, vjust=vjust,
gp=gpr)

data &amp;lt;- ggplot_build(gg)

gt &amp;lt;- ggplot_gtable(data)
gt &amp;lt;- gtable_add_rows(gt, grobHeight(subtitle), 2)
gt &amp;lt;- gtable_add_grob(gt, subtitle, 3, 4, 3, 4, 8, "off", "subtitle")
gt &amp;lt;- gtable_add_rows(gt, grid::unit(bottom_margin, "pt"), 3) if (newpage) grid.newpage() if (is.null(vp)) { grid.draw(gt) } else { if (is.character(vp)) seekViewport(vp) else pushViewport(vp) grid.draw(gt) upViewport() } invisible(data) } ## ````````````````````````````````````````````` ## ````````````````````````````````````````````` #### Read Data #### ## ````````````````````````````````````````````` setwd("") # set as required ## df.master #### df.master = read.csv( "2. Data/Global Peach Index.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = c("", "NA") ) ## secondary data ## N/A ## ````````````````````````````````````````````` ## ````````````````````````````````````````````` #### Manipulate Data #### ## ````````````````````````````````````````````` ### fixing primary data set (df.master) names(df.master) = tolower(names(df.master)) # master replica df.1 = df.master # removing redundant cols if ((length(unique(df.1$item))) ==1) { df.1 = df.1 %&amp;gt;%
select(-item)
}

if ((length(unique(df.1$metric))) ==1)
{
df.1 =
df.1 %&amp;gt;%
select(-metric)
}

## workhorse fn
# inputs: numeric vector
# outputs: data frame
# implementation: return df with
# 1. mean
# 2. count of values above and below mean
fn_calc_prop = function(my.vector)
{
#print("entering ---- ")
#print(my.vector)
#print(str(my.vector))

# removing na
my.vector = my.vector %&amp;gt;% na.omit()

i.mean = my.vector %&amp;gt;% mean()
#print(" mean")
#print(i.mean)

df.temp = data.frame(mean=i.mean,gt=0,lt=0)

df.temp$gt =
as.data.frame(my.vector) %&amp;gt;%
filter(. &amp;gt; i.mean) %&amp;gt;%
summarise(gt = n())

df.temp$lt =
as.data.frame(my.vector) %&amp;gt;%
filter(. &amp;lt;= i.mean) %&amp;gt;%
summarise(lt = n())

return (df.temp)
}

df.summary =
df.1 %&amp;gt;%
# filter numeric cols
keep(is.numeric) %&amp;gt;%
# for each numeric col
# apply "fn_calc_prop"
map_df(fn_calc_prop)

# appending years cols back
v.names = names(df.1)[-1] # except the first col containing country
df.summary$year = v.names
# removing initial x added to the col names
df.summary$year = gsub("x", "", df.summary$year)

# re-arranging cols for visibility
df.summary =
df.summary %&amp;gt;%
select(year,mean,gt,lt)

# flattening out the list of gt and lt
# dont know why are list formed
df.summary$gt = unlist(df.summary$gt)
df.summary$lt = unlist(df.summary$lt)

# store df
save_df(df.summary,"df.summary.csv",FALSE)

to.discretise.vars &amp;lt;- c( "mean" ) df.summary.2 = df.summary %&amp;gt;%
lmap_at(to.discretise.vars, cut_categories)

# http://rforpublichealth.blogspot.com/2012/09/from-continuous-to-categorical.html
# df.summary$mean %&amp;gt;%
# cut(breaks=c(100000,200000,300000,400000,500000),labels=c(1:4))

df.summary$cat =
df.summary$mean %&amp;gt;%
cut(breaks=c(100000,200000,300000,400000,500000),labels=c("1K+","2K+","3K+","4K+"))

# instead of messing with scales, transform the data
df.summary$mean.k &amp;lt;- df.summary$mean/1000 # proportion of countries above the mean df.summary.3 = df.summary %&amp;gt;%
#select(gt,lt) %&amp;gt;%
mutate(my.sum = gt + lt, gt.mean = round((gt / my.sum),3)) %&amp;gt;%
select(-my.sum)

## `````````````````````````````````````````````

With all the munging done, lets do some ggplot2 magic …


## `````````````````````````````````````````````
#### Plot 1 ####
## `````````````````````````````````````````````

fat.casual.bbq = c("#716065", "#cb8052", "#f9ded0", "#f961a4","#57324e")

# http://stackoverflow.com/questions/25937000/ggplot2-error-discrete-value-supplied-to-continuous-scale
g.1 = ggplot(df.summary, aes(x = year, y = mean.k)) + 
 geom_point(aes(fill=cat),size = 2,shape=21) +
 scale_fill_manual(values=fat.casual.bbq) + 
 xlab("") + ylab("") + 
 theme_minimal() 

# making the x-label horizontal
g.3 = g.1 + theme(text = element_text(size = 10),
 axis.text.x = element_text(angle = 90, hjust = 1))

g.3 = g.3 + ggtitle("World Peach Production")

palette &amp;lt;- c("#FFFFFF", "#F0F0F0", "#D9D9D9", "#BDBDBD", "#969696", "#737373",
 "#525252", "#252525", "#000000") # = brewer.pal 'greys'
color.background = palette[2]
color.grid.major = palette[3]
color.axis.text = palette[6]
color.axis.title = palette[7]
color.title = palette[9]
#theme_bw(base_size=9) 

g.3 = g.3 + theme(
 panel.background=element_rect(fill=color.background, color=color.background),
 plot.background=element_rect(fill=color.background, color=color.background),
 #panel.border=element_rect(color=color.background)
 plot.title=element_text(color=color.title, size=16, vjust=1.25, hjust=0),
 #plot.title = element_text(hjust=0, size=16)
 plot.margin = unit(c(0.35, 0.2, 0.3, 0.35), "cm"),

 # Position legend in graph, where x,y is 0,0 (bottom left) to 1,1 (top right)
 legend.position=c(.15, .9),

 legend.box = "horizontal",
 legend.direction = "horizontal",
 legend.title= element_text(size=0),
 legend.text=element_text(size=6),
 legend.key.size=unit(0.2, "cm"),
 legend.key.width=unit(0.5, "cm")

)

g.3

g.4 &amp;lt;- g.3 + annotate("text", x = 30, y = 200, label = "1999 &amp;gt;&amp;gt; mean production crosses 200K",
 color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.5 &amp;lt;- g.4 + annotate("text", x = 35, y = 300, label = "2005 &amp;gt;&amp;gt; mean production crosses 300K",
 color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.6 &amp;lt;- g.5 + annotate("text", x = 42, y = 390, label = "2011 &amp;gt;&amp;gt; mean production crosses 400K",
 color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.6

# trying to reduce space between axis and plot
# but does not work

# # http://stackoverflow.com/questions/20220424/ggplot2-bar-plot-no-space-between-bottom-of-geom-and-x-axis-keep-space-above
# g.6 + 
# coord_cartesian(xlim = c(1961,2012), ylim = c(0,450))
# #scale_x_continuous(limits = c(1961,2012), expand = c(0, 0)) +
# #scale_y_continuous(limits = c(0,420), expand = c(0, 0)) +
# 
# g.6 + 
# geom_blank(aes(y=1.1*..count..), stat="count") 

## subtitle
# set the name of the current plot object to `gg`
gg &amp;lt;- g.6

# define the subtitle text
subtitle &amp;lt;- 
 "Rapid increase since 2005, taking only 6 years to cross another 100K"

p1 = ggplot_with_subtitle(gg, subtitle,
 bottom_margin=20, lineheight=0.9)

p1

## `````````````````````````````````````````````

This is what it generates:

plot-1

Onto the next one:

## `````````````````````````````````````````````
#### Plot 2 ####
## `````````````````````````````````````````````
g.1 = ggplot(df.summary.3, aes(x = year, y = gt.mean)) +
geom_point(aes(fill=cat),size = 2,shape=21) +
scale_fill_manual(values=fat.casual.bbq) +
xlab("") + ylab("") +
theme_minimal()

# making the x-label horizontal
g.3 = g.1 + theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 90, hjust = 1))


g.3 = g.3 + ggtitle("A Few Dominate the World Peach Production")

palette <- c("#FFFFFF", "#F0F0F0", "#D9D9D9", "#BDBDBD", "#969696", "#737373",
"#525252", "#252525", "#000000") # = brewer.pal 'greys'
color.background = palette[2]
color.grid.major = palette[3]
color.axis.text = palette[6]
color.axis.title = palette[7]
color.title = palette[9]
#theme_bw(base_size=9)


g.3 = g.3 + theme(
panel.background=element_rect(fill=color.background, color=color.background),
plot.background=element_rect(fill=color.background, color=color.background),
#panel.border=element_rect(color=color.background)
plot.title=element_text(color=color.title, size=16, vjust=1.25, hjust=0),
#plot.title = element_text(hjust=0, size=16)
plot.margin = unit(c(0.35, 0.2, 0.3, 0.35), "cm"),

# Position legend in graph, where x,y is 0,0 (bottom left) to 1,1 (top right)
legend.position=c(.85, .8),

legend.box = "horizontal",
legend.direction = "horizontal",
legend.title= element_text(size=0),
legend.text=element_text(size=6),
legend.key.size=unit(0.2, "cm"),
legend.key.width=unit(0.5, "cm")



)

g.3

# http://stackoverflow.com/questions/24237399/how-to-select-the-rows-with-maximum-values-in-each-group-with-dplyr
df.summary.3 %>%
filter(gt.mean == max(gt.mean))

# max(df.summary.3$gt.mean)

g.4 <- g.3 + annotate("text", x = 28, y = 0.265, label = "1975 >> highest proportion, where mean production is 126K",
color="#7a7d7e", size=3, vjust=-1, fontface="bold")

#g.4

df.summary.3 %>%
filter(gt.mean == min(gt.mean))


g.5 <- g.4 + annotate("text", x = 39, y = 0.08, label = "2010~12 >> lowest proportion, where mean production is ~400K",
color="#7a7d7e", size=3, vjust=-1, fontface="bold")

g.5

## subtitle
# set the name of the current plot object to `gg`
gg <- g.5

# define the subtitle text
subtitle <-
"The proportion of countries above the mean production for a given year, on Y-Axis"

p2 = ggplot_with_subtitle(gg, subtitle,
bottom_margin=20, lineheight=0.9)



## `````````````````````````````````````````````

## `````````````````````````````````````````````
#### Clean up ####
## `````````````````````````````````````````````
# rm(list=ls())

Here is the result:

plot-2-v-1

It would be interesting to compare the first plot with how world population has grown compared to the peach production.

What’s On The Menu (2)

Inspired from Rasmus Bååth’s post “A Fun Gastronomical Dataset: What’s on the Menu?” I set out to do something similar. Starting from the CSV file kindly shared by the maestro.

I thought of categorising the various food items and analyse the trend of healthy eating over time. However, I was stumped fairly quickly, failing to execute matches for words , though documentation suggested it being possible using “boundary”.

Sadly “Chateau” and “Steak” both matched the keyword “tea”. My unsuccessful attempt looked something like this – and various other combos :

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word"))))

Barging ahead – prices seemed a good starting point, as to how have they fared across the years.

food_over_time <- map_df(food, function(food)
{

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word")))) %>%
 mutate(food = food) %>%
 group_by(year,food) %>%
 summarise(avg_price = mean(price,na.rm = TRUE))

}) # end food_over_time

I ended up with a lot of NAN(s) and as stated here and here, they are solely my doing. On a side note, Peter Bashai mentions a few methods to plot the missing values in d3.

Ignoring  why NAN were produced, I decide to simply replace them with NA

# is.nan is provided to check specifically for NaN
food_over_time %>% 
 filter(is.nan(avg_price))

food_over_time %>% 
 mutate(avg_price = ifelse(is.nan(avg_price),NA,avg_price))

And then imputing the data. Using the default values from mice package.

# imputing it
imp_food_over_time = 
 food_over_time %>%
 mice()

summary(imp_food_over_time)

imp_data = complete(imp_food_over_time,1)

Relying on the ggplot snipped from the original post

# A reusable list of ggplot2 directives to produce a lineplot
food_time_plot <- list(
 geom_line(),
 geom_point(),
 facet_wrap(~ food),
 theme_minimal(),
 theme(legend.position = "none"))

And now plotting:

food_over_time %>% filter(food %in% c("coffee", "tea")) %>%
ggplot(aes(year, avg_price , color = food)) + food_time_plot

this is what I get:

rplot2

ooops …

we can remove that peaking value of 2103 in year 1999 but something is fundamentally wrong there . Here is the data set for 1999

# what happened to tea in 1999
df.tea.99 = 
df.master %>%
 filter(year == 1999) %>%
 filter(str_detect(dish_name, regex("tea", ignore_case = TRUE, boundary=type("word")))) %>%
 select(year,dish_name,price)

and the resulting data set has entries like “Chateaubriand tranche a table, aux legumes en sauce bearnaise” which have nothing to do with “tea”