Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
For more details on Apache Spark please refer –
In this post I am going to discuss the steps involved in setting up Spark on Windows machine, and we will be using Python to interact with this system. The best part is that it will all take 10 minutes to get started and no additional tools like Cygwin or Git is required.
Pre-requisites:
- OS: Windows 7 +
- Minimum 2 Core processor
- Minimum RAM – 4 GB
- Minimum Disk space – 10 GB
- Apache Spark (latest stable version)
- Python – Anaconda version (download its installer from here)
- I am using Python 3
- Spyder IDE (will be installed as part of Anaconda)
- Interest in Apache Spark (most important of all)
Installation
For requisite 2, navigate to Apache Spark download page and download a latest stable version, in my case I selected pre-built one for Hadoop Version 2.6 as shown below:
Once downloaded, extract the contents and put the contents to a folder in C drive or D drive, as I did by creating a folder named ‘Spark’ in D drive and move the contents therein, as shown below:
And that’s it, Spark is installed now.
Now, as mentioned in previous section for requisite 3, I downloaded Python 3 version Anaconda and installed it on my machine with OS Windows 7 (64-bit), as shown below:
After the above installation is complete, the whole package should come up in your ‘All Programs’ at Windows Start button as shown below:
We can see that Python and Spyder IDE have been installed.
Setting up Spark
Now comes the real part, where we start with Spark using Python in an interactive mode. This mode we are going to proceed with is the enhanced version of PySpark, which is done using IPython console (an enhanced Python interpreter).
Step 1: To begin open the Spyder IDE, which will look like below:
Here, Leftmost pane is for project navigation space, middle one is code editor and leftmost has an active IPython console in interactive mode. (All are labeled in figure).
Step 2: Create a file setUpSpark.py
in your project workspace and paste the code as shown below:
- # *** coding: utf-8 ***
- """
- Ensure the code have execute privileges
- ----------------------------------------------------------------------
- Execute this script once when Spyder is started on Windows
- ----------------------------------------------------------------------
- """
-
- import os
- import sys
- os.chdir("<type your workspace directory here>")
- os.curdir
-
- # Configure the environment.
- # Set this up to the directory where Spark is installed
- if 'SPARK_HOME' not in os.environ:
- os.environ['SPARK_HOME'] = '<Path where Spark is installed>'
-
- # Create a variable for our root path
- SPARK_HOME = os.environ['SPARK_HOME']
-
- #Add the following paths to the system path.
- sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
- sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
- sys.path.insert(0,os.path \
- .join(SPARK_HOME,"python","lib","pyspark.zip"))
- sys.path.insert(0,os.path \
- .join(SPARK_HOME,"python","lib","py4j-0.9-src.zip"))
-
- # Initiate Spark context.
- from pyspark import SparkContext
- from pyspark import SparkConf
-
-
- # Configure Spark Settings
- conf=SparkConf()
- conf.set("spark.executor.memory", "1g")
- conf.set("spark.cores.max", "2")
-
- conf.setAppName("Shaz Spark")
-
- ## Initialize SparkContext.
- sc = SparkContext('local', conf=conf)
-
- #Test with a data file, I used an auto data file
- lines=sc.textFile("data/auto-data.csv")
- print (lines.count())
The code is self-explanatory with the comments I have added to along with it. Please let me know if any clarity is required.
Step 3: We have now our Spark application Up and running with the Spark Context as sc
. View the application status using its web UI
, to which you can visit using the web address ashttp://localhost:4040/job s/
With this we conclude this post, and I encourage you to try hand on with different Spark capabilities. I will share some basic Spark operations in my next post. Till then, Happy Sparking.
- Minimum 2 Core processor
- Minimum RAM – 4 GB
- Minimum Disk space – 10 GB
- I am using Python 3
Installation
For requisite 2, navigate to Apache Spark download page and download a latest stable version, in my case I selected pre-built one for Hadoop Version 2.6 as shown below:
Once downloaded, extract the contents and put the contents to a folder in C drive or D drive, as I did by creating a folder named ‘Spark’ in D drive and move the contents therein, as shown below:
And that’s it, Spark is installed now.
Now, as mentioned in previous section for requisite 3, I downloaded Python 3 version Anaconda and installed it on my machine with OS Windows 7 (64-bit), as shown below:
After the above installation is complete, the whole package should come up in your ‘All Programs’ at Windows Start button as shown below:
We can see that Python and Spyder IDE have been installed.
Setting up Spark
Now comes the real part, where we start with Spark using Python in an interactive mode. This mode we are going to proceed with is the enhanced version of PySpark, which is done using IPython console (an enhanced Python interpreter).
Step 1: To begin open the Spyder IDE, which will look like below:
Here, Leftmost pane is for project navigation space, middle one is code editor and leftmost has an active IPython console in interactive mode. (All are labeled in figure).
Step 2: Create a file
setUpSpark.py
in your project workspace and paste the code as shown below:
- # *** coding: utf-8 ***
- """
- Ensure the code have execute privileges
- ----------------------------------------------------------------------
- Execute this script once when Spyder is started on Windows
- ----------------------------------------------------------------------
- """
- import os
- import sys
- os.chdir("<type your workspace directory here>")
- os.curdir
- # Configure the environment.
- # Set this up to the directory where Spark is installed
- if 'SPARK_HOME' not in os.environ:
- os.environ['SPARK_HOME'] = '<Path where Spark is installed>'
- # Create a variable for our root path
- SPARK_HOME = os.environ['SPARK_HOME']
- #Add the following paths to the system path.
- sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
- sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
- sys.path.insert(0,os.path \
- .join(SPARK_HOME,"python","lib","pyspark.zip"))
- sys.path.insert(0,os.path \
- .join(SPARK_HOME,"python","lib","py4j-0.9-src.zip"))
- # Initiate Spark context.
- from pyspark import SparkContext
- from pyspark import SparkConf
- # Configure Spark Settings
- conf=SparkConf()
- conf.set("spark.executor.memory", "1g")
- conf.set("spark.cores.max", "2")
- conf.setAppName("Shaz Spark")
- ## Initialize SparkContext.
- sc = SparkContext('local', conf=conf)
- #Test with a data file, I used an auto data file
- lines=sc.textFile("data/auto-data.csv")
- print (lines.count())
The code is self-explanatory with the comments I have added to along with it. Please let me know if any clarity is required.
Step 3: We have now our Spark application Up and running with the Spark Context as s/
sc
. View the application status using its web UI
, to which you can visit using the web address ashttp://localhost:4040/job
With this we conclude this post, and I encourage you to try hand on with different Spark capabilities. I will share some basic Spark operations in my next post. Till then, Happy Sparking.