Recently I have been yearning to explore NYC Bike Data for ridership trends. However the size of data and humbleness of laptop forced me otherwise. I will soon be writing in detail on how to setup a CSV to SQLite Database in R. This post however focuses on one of the other alternates – Apache Spark. Specifically installing “Apache Spark” on a “Windows 7” laptop.
This post is more of a compilation of various sources from internet which helped me install it, specifically  and . All the credit to these folks for such wonderful guidance.
- You will need Admin rights on the machine.
- Apache Hadoop NOT mandatory to work with Spark or run Spark applications.
- Cautiously avoid having any spaces between the folder names hence “D:\Perso2” is better than “D:\Perso 2 3” and will save you a lot of trouble later.
- Go back and read 3.
- For setting up environment variables, refer to #10
- This guide limits itself to Spark Installation. And does not go into how to use Spark once it is set up.
2. SPARK BINARIES
- Download a pre-built Spark binary for Hadoop from here.
- Unzip the *.tar file by using WinRar. (Refer to caveat #3)
- The benefit of using a pre-built binary is that you will not have to go through the trouble of building the spark binaries from scratch.
- Setup the path variable SPARK_HOME to whatever path you extracted the binaries to. (Refer to caveat #3). For me it was: D:\Perso2\spark\bin
3. WIN UTILS
- The official release of Hadoop does not include the required binaries (like winutils.exe) which are required to run Hadoop.
- One should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2 from here.
- You are free to place it anywhere, but do make a “bin” sub-folder. For me it was: D:\Perso2\winutils\bin.
- Setup the path variable. HADOOP_HOME to whatever path you extracted the binaries to. (Refer to caveat #3).
- Download and install latest version of Java JDK.
- The default path will be something like “C;//Program Files ….” where there is a space between Program and Files, and was a cause of trouble for me. I recommend to use another folder, as you would not want your other installed software to have any issues.
- Setup the path variable. JAVA_HOME to whatever path you extracted the binaries to. (Refer to caveat #3). For me it was: C:\Java\jdk1.8.0_131\
- To check is java is already installed.
- Create C:\tmp\hive directory. It is the default value of exec.scratchdir configuration propertyin Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.
- To set the write privileges, execute the following via command prompt. where: D:\Perso2\winutils\bin was the path where winutils.exe was stored.
- Some sources suggest it with –R switch which did not work for me.
- Do check that the permissions have been granted as required. (highlighted in yellow)
6.PATH ENVIRONMENT VARIABLE
- Append these system variables namely SPARK_HOME, HADOOP_HOME and JAVA_HOME to PATH variable.
%JAVA_HOME%\BIN; %HADOOP_HOME%; %SPARK_HOME%
- It is important to put a semicolon to separate these entries.
- Do check that the path is setup correctly. Simply path > path.txt
- From the command prompt, change to spark directory, and then to bin sub directory. For me it was D:\Perso2\spark\bin. Refer to #9 for easier command prompt handling
- Run the command “spark-shell” and you should see the spark logo with the scala prompt
8. SPARK JOBS
- Fire up your browser. Type “localhost:4040” in the address bar and voila.
9. OPENING CMD PROMPT IN SPECIFIC FOLDER
If you’re already in the directory you want, you can:
- Hold down Shift when opening the Explorer File menu, then click on “Open command window here”. If you can’t see the menu bar, press Alt-Shift-F – Alt-F to open the File menu, plus Shift.
- Shift-right-click on the background of the Explorer window, then click on “Open command window here”.
10. SETTING ENVIRONMENT VARIABLES
- Right click on Computer- Left click on Properties
- Click on Advanced System Settings
- Under Start up & Recovery, Click on the button labelled as “Environment Variable”
- You will see the window divided into two parts, the upper part will read User variables for username and the lower part will read System variables.
- As part of this post, we will create new system variables, hence click on “New” button under System variable.