Spark on Windows 7

Recently I have been yearning to explore NYC Bike Data for ridership trends. However the size of data and humbleness of laptop forced me otherwise. I will soon be writing in detail on how to setup a CSV to SQLite Database in R. This post however focuses on one of the other alternates – Apache Spark. Specifically installing “Apache Spark” on a “Windows 7” laptop.

This post is more of a compilation of various sources from internet which helped me install it, specifically [1] and [2]. All the credit to these folks for such wonderful guidance.


1. CAVEATS

  1. You will need Admin rights on the machine.
  2. Apache Hadoop NOT mandatory to work with Spark or run Spark applications.
  3. Cautiously avoid having any spaces between the folder names hence “D:\Perso2” is better than “D:\Perso 2 3” and will save you a lot of trouble later.
  4. Go back and read 3.
  5. For setting up environment variables, refer to #10
  6. This guide limits itself to Spark Installation. And does not go into how to use Spark once it is set up.

2. SPARK BINARIES

  • Download a pre-built Spark binary for Hadoop from here.

1 Spark download

  • Unzip the *.tar file by using WinRar. (Refer to caveat #3)
  • The benefit of using a pre-built binary is that you will not have to go through the trouble of building the spark binaries from scratch.
  • Setup the path variable SPARK_HOME to whatever path you extracted the binaries to. (Refer to caveat #3). For me it was: D:\Perso2\spark\bin

3. WIN UTILS

  • The official release of Hadoop does not include the required binaries (like winutils.exe) which are required to run Hadoop.
  • One should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2 from here.
  • You are free to place it anywhere, but do make a “bin” sub-folder. For me it was: D:\Perso2\winutils\bin.
  • Setup the path variable. HADOOP_HOME to whatever path you extracted the binaries to. (Refer to caveat #3).

4.JAVA

  • Download and install latest version of Java JDK.
  • The default path will be something like “C;//Program Files ….” where there is a space between Program and Files, and was a cause of trouble for me. I recommend to use another folder, as you would not want your other installed software to have any issues.
  • Setup the path variable. JAVA_HOME to whatever path you extracted the binaries to. (Refer to caveat #3). For me it was: C:\Java\jdk1.8.0_131\
  • To check is java is already installed.2. Check Java Version

5.TEMP DIR

  • Create C:\tmp\hive directory. It is the default value of exec.scratchdir configuration propertyin Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.
  • To set the write privileges, execute the following via command prompt. 3. Write Priviledges.pngwhere: D:\Perso2\winutils\bin was the path where winutils.exe was stored.
  • Some sources suggest it with –R switch which did not work for me.
  • Do check that the permissions have been granted as required. (highlighted in yellow)4. Check Priviledge.png

6.PATH ENVIRONMENT VARIABLE

  • Append these system variables namely SPARK_HOME, HADOOP_HOME and JAVA_HOME to PATH variable.

%JAVA_HOME%\BIN; %HADOOP_HOME%; %SPARK_HOME%

  • It is important to put a semicolon to separate these entries.
  • Do check that the path is setup correctly. Simply path > path.txt

7.RUNNING SPARK

  • From the command prompt, change to spark directory, and then to bin sub directory. For me it was D:\Perso2\spark\bin. Refer to #9 for easier command prompt handling
  • Run the command “spark-shell” and you should see the spark logo with the scala prompt5. Spark Shell - a.png5. Spark Shell - b

8. SPARK JOBS

  • Fire up your  browser. Type “localhost:4040” in the address bar and voila.6. Spark Jobs

 


9. OPENING CMD PROMPT IN SPECIFIC FOLDER

If you’re already in the directory you want, you can:

  • Hold down Shift when opening the Explorer File menu, then click on “Open command window here”. If you can’t see the menu bar, press Alt-Shift-F – Alt-F to open the File menu, plus Shift.
  • Shift-right-click on the background of the Explorer window, then click on “Open command window here”.

10. SETTING ENVIRONMENT VARIABLES

  • Right click on Computer- Left click on Properties
  • Click on Advanced System Settings
  • Under Start up & Recovery, Click on the button labelled as “Environment Variable”
  • You will see the window divided into two parts, the upper part will read User variables for username and the lower part will read System variables.
  • As part of this post, we will create new system variables, hence click on “New” button under System variable.

11. REFERENCES

  1. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
  2. https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/
  3. http://stackoverflow.com/questions/60904/how-can-i-open-a-cmd-window-in-a-specific-location
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s