What’s On The Menu (2)

Inspired from Rasmus Bååth’s post “A Fun Gastronomical Dataset: What’s on the Menu?” I set out to do something similar. Starting from the CSV file kindly shared by the maestro.

I thought of categorising the various food items and analyse the trend of healthy eating over time. However, I was stumped fairly quickly, failing to execute matches for words , though documentation suggested it being possible using “boundary”.

Sadly “Chateau” and “Steak” both matched the keyword “tea”. My unsuccessful attempt looked something like this – and various other combos :

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word"))))

Barging ahead – prices seemed a good starting point, as to how have they fared across the years.

food_over_time <- map_df(food, function(food)
{

df.master %>%
 filter(str_detect(dish_name, regex(food, ignore_case = TRUE, boundary=type("word")))) %>%
 mutate(food = food) %>%
 group_by(year,food) %>%
 summarise(avg_price = mean(price,na.rm = TRUE))

}) # end food_over_time

I ended up with a lot of NAN(s) and as stated here and here, they are solely my doing. On a side note, Peter Bashai mentions a few methods to plot the missing values in d3.

Ignoring  why NAN were produced, I decide to simply replace them with NA

# is.nan is provided to check specifically for NaN
food_over_time %>% 
 filter(is.nan(avg_price))

food_over_time %>% 
 mutate(avg_price = ifelse(is.nan(avg_price),NA,avg_price))

And then imputing the data. Using the default values from mice package.

# imputing it
imp_food_over_time = 
 food_over_time %>%
 mice()

summary(imp_food_over_time)

imp_data = complete(imp_food_over_time,1)

Relying on the ggplot snipped from the original post

# A reusable list of ggplot2 directives to produce a lineplot
food_time_plot <- list(
 geom_line(),
 geom_point(),
 facet_wrap(~ food),
 theme_minimal(),
 theme(legend.position = "none"))

And now plotting:

food_over_time %>% filter(food %in% c("coffee", "tea")) %>%
ggplot(aes(year, avg_price , color = food)) + food_time_plot

this is what I get:

rplot2

ooops …

we can remove that peaking value of 2103 in year 1999 but something is fundamentally wrong there . Here is the data set for 1999

# what happened to tea in 1999
df.tea.99 = 
df.master %>%
 filter(year == 1999) %>%
 filter(str_detect(dish_name, regex("tea", ignore_case = TRUE, boundary=type("word")))) %>%
 select(year,dish_name,price)

and the resulting data set has entries like “Chateaubriand tranche a table, aux legumes en sauce bearnaise” which have nothing to do with “tea”

 

 

 

 

Advertisements