2016 January – Dear Data Challenge

The theme for Jan was “Things We Say” … Here is the journey …


The first idea was to explore “NLP”. Interesting links are as follows:

  1. A toolkit for corpus linguistics (https://github.com/interrogator/corpkit)
  2. Machine Learning for Text Analysis (http://www.monkeylearn.com/)
  3. Segmentation of Twitter Timelines via Topic Modeling (http://alexperrier.github.io/jekyll/update/2015/09/16/segmentation_twitter_timelines_lda_vs_lsa.html)


Given “Pyhton” was preferred over “R”, I thought of doing a mash-up of the two, and “Jupyter” led me to “Beaker” (http://www.opendatascience.com/blog/jupyter-zeppelin-beaker-the-rise-of-the-notebooks/) and (http://blog.dominodatalab.com/interactive-data-science/)

Time Maps

Another idea was to explore “Time Maps” as done here to explore the 2015 Presidential Debates (http://alexperrier.github.io/jekyll/update/2015/11/19/timemaps-presidential-debates-dynamics.html)

Love Actually

But if you do explore any of recommendations from this blog, go to (http://varianceexplained.org/r/love-actually-network/) An amazing analysis of the movie dialogues from “Love Actually” by David Robinson.

Kate Winslet mentioned relatively longer dialogues in Steve Jobs (2015 movie). Inspired by analysis of “Love Actually” I thought to compare the average length of her dialogues to some of her other movies. But I could not pass the hurdle of parsing the movie script file, which was devoid of any delimiters for the dialogues.

Sentiment Analysis could have been one option, as done here for State of the Union speeches.(http://www.moreorlessnumbers.com/2016/01/state-of-union-speeches-and-data.html)

WhatsApp and Facebook Analysis by Forrester

Reineke Reitsma mentioned share of different messangers (Viber, Skype .. ) to share the new year wishes across Europe. (http://blogs.forrester.com/reineke_reitsma/15-01-05-the_data_digest_whatsapp_and_facebook_messenger_wish_us_a_happy_new_year)

And here is an R package that provides a suite of tools for collecting and constructing networks from social media data (https://t.co/91RkKRTby4)

My Analysis

I stuck to get a count of different messages in my email – as follows:

2016 Jan Data Capture

And ended up doing a chart of split across the categories:

Analysis of Email Rcvd


Analysis of Email Sent

This is the hand drawn version:

2016 Jan - Front

and the back end:

2016 Jan - Back