Hadoop and Spark Guide: Getting started with Apache Spark on Windows in 10 minutes

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

For more details on Apache Spark please refer –

Apache Spark Wikipedia page

Apache Spark official website

In this post I am going to discuss the steps involved in setting up Spark on Windows machine, and we will be using Python to interact with this system. The best part is that it will all take 10 minutes to get started and no additional tools like Cygwin or Git is required.

Pre-requisites:

OS: Windows 7 +

Minimum 2 Core processor
Minimum RAM – 4 GB
Minimum Disk space – 10 GB

Apache Spark (latest stable version)
Python – Anaconda version (download its installer from here)

I am using Python 3

Spyder IDE (will be installed as part of Anaconda)
Interest in Apache Spark (most important of all)

Installation

For requisite 2, navigate to Apache Spark download page and download a latest stable version, in my case I selected pre-built one for Hadoop Version 2.6 as shown below:

Once downloaded, extract the contents and put the contents to a folder in C drive or D drive, as I did by creating a folder named ‘Spark’ in D drive and move the contents therein, as shown below:

And that’s it, Spark is installed now.

Now, as mentioned in previous section for requisite 3, I downloaded Python 3 version Anaconda and installed it on my machine with OS Windows 7 (64-bit), as shown below:

After the above installation is complete, the whole package should come up in your ‘All Programs’ at Windows Start button as shown below:

We can see that Python and Spyder IDE have been installed.

Setting up Spark

Now comes the real part, where we start with Spark using Python in an interactive mode. This mode we are going to proceed with is the enhanced version of PySpark, which is done using IPython console (an enhanced Python interpreter).

Step 1: To begin open the Spyder IDE, which will look like below:

Here, Leftmost pane is for project navigation space, middle one is code editor and leftmost has an active IPython console in interactive mode. (All are labeled in figure).

Step 2: Create a file setUpSpark.py in your project workspace and paste the code as shown below:


# *** coding: utf-8 ***
"""
Ensure the code have execute privileges
----------------------------------------------------------------------
Execute this script once when Spyder is started on Windows
----------------------------------------------------------------------
"""
 
import os
import sys
os.chdir("<type your workspace directory here>")
os.curdir
 
# Configure the environment. 
# Set this up to the directory where Spark is installed
if 'SPARK_HOME' not in os.environ:
    os.environ['SPARK_HOME'] = '<Path where Spark is installed>'
 
# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']
 
#Add the following paths to the system path.
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path \
 .join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path \
 .join(SPARK_HOME,"python","lib","py4j-0.9-src.zip"))
 
# Initiate Spark context.
from pyspark import SparkContext
from pyspark import SparkConf
 
 
# Configure Spark Settings
conf=SparkConf()
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "2")
 
conf.setAppName("Shaz Spark")
 
## Initialize SparkContext. 
sc = SparkContext('local', conf=conf)
 
#Test with a data file, I used an auto data file
lines=sc.textFile("data/auto-data.csv")
print (lines.count())

The code is self-explanatory with the comments I have added to along with it. Please let me know if any clarity is required.

Step 3: We have now our Spark application Up and running with the Spark Context as sc. View the application status using its web UI, to which you can visit using the web address ashttp://localhost:4040/jobs/

With this we conclude this post, and I encourage you to try hand on with different Spark capabilities. I will share some basic Spark operations in my next post. Till then, Happy Sparking.

12 comments:

Unknown19 April 2017 at 22:54
Hi,

i getting below error when i run setUpSpark.py file

" from pyspark import SparkContext
ModuleNotFoundError: No module named 'pyspark' "

Kindly any one can help me?
Thanks in advance.
Mallesh.K
madhu30 May 2017 at 07:36
HI I GOT THIS ERROR WHILE I RUN THIS CODE IN SPYDER , I HAVE ANACONDA PY3.5 VERSION,
raise Exception("Java gateway process exited before sending the driver its port number")

Exception: Java gateway process exited before sending the driver its port number
Anonymous3 July 2017 at 23:49
This comment has been removed by the author.
Raouf Lamari8 December 2017 at 04:42
This comment has been removed by the author.
Raouf Lamari8 December 2017 at 04:49
Using Spyder with Anaconda, got

Exception: Java gateway process exited before sending the driver its port number
Anonymous20 December 2017 at 06:12
hi after executing "sc = SparkContext('local', conf=conf)" line i am getting error like "
FileNotFoundError: [WinError 2] The system cannot find the file specified"
Unknown18 August 2018 at 23:36
aise Exception("Java gateway process exited before sending its port number")

Exception: Java gateway process exited before sending its port number
arjun1 August 2019 at 22:29
This comment has been removed by the author.
arjun21 August 2019 at 02:52
Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
Surya Informatics

Monday, 11 July 2016

Getting started with Apache Spark on Windows in 10 minutes

12 comments: