Monday, 11 July 2016

Getting started with Apache Spark on Windows in 10 minutes

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
For more details on Apache Spark please refer –
In this post I am going to discuss the steps involved in setting up Spark on Windows machine, and we will be using Python to interact with this system. The best part is that it will all take 10 minutes to get started and no additional tools like Cygwin or Git is required.
Pre-requisites:
  1. OS: Windows 7 +
    1. Minimum 2 Core processor
    2. Minimum RAM – 4 GB
    3. Minimum Disk space – 10 GB
  2. Apache Spark (latest stable version)
  3. Python – Anaconda version (download its installer from here)
    1. I am using Python 3
  4. Spyder IDE (will be installed as part of Anaconda)
  5. Interest in Apache Spark (most important of all)
Installation
For requisite 2, navigate to Apache Spark download page and download a latest stable version, in my case I selected pre-built one for Hadoop Version 2.6 as shown below:

Once downloaded, extract the contents and put the contents to a folder in C drive or D drive, as I did by creating a folder named ‘Spark’ in D drive and move the contents therein, as shown below:
And that’s it, Spark is installed now.

Now, as mentioned in previous section for requisite 3, I downloaded Python 3 version Anaconda and installed it on my machine with OS Windows 7 (64-bit), as shown below:

After the above installation is complete, the whole package should come up in your ‘All Programs’ at Windows Start button as shown below:
We can see that Python and Spyder IDE have been installed.

Setting up Spark
Now comes the real part, where we start with Spark using Python in an interactive mode. This mode we are going to proceed with is the enhanced version of PySpark, which is done using IPython console (an enhanced Python interpreter).

Step 1: To begin open the Spyder IDE, which will look like below:
Here, Leftmost pane is for project navigation space, middle one is code editor and leftmost has an active IPython console in interactive mode. (All are labeled in figure).

Step 2: Create a file setUpSpark.py in your project workspace and paste the code as shown below:
  1. # *** coding: utf-8 ***
  2. """
  3. Ensure the code have execute privileges
  4. ----------------------------------------------------------------------
  5. Execute this script once when Spyder is started on Windows
  6. ----------------------------------------------------------------------
  7. """
  8.  
  9. import os
  10. import sys
  11. os.chdir("<type your workspace directory here>")
  12. os.curdir
  13.  
  14. # Configure the environment.
  15. # Set this up to the directory where Spark is installed
  16. if 'SPARK_HOME' not in os.environ:
  17. os.environ['SPARK_HOME'] = '<Path where Spark is installed>'
  18.  
  19. # Create a variable for our root path
  20. SPARK_HOME = os.environ['SPARK_HOME']
  21.  
  22. #Add the following paths to the system path.
  23. sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
  24. sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
  25. sys.path.insert(0,os.path \
  26. .join(SPARK_HOME,"python","lib","pyspark.zip"))
  27. sys.path.insert(0,os.path \
  28. .join(SPARK_HOME,"python","lib","py4j-0.9-src.zip"))
  29.  
  30. # Initiate Spark context.
  31. from pyspark import SparkContext
  32. from pyspark import SparkConf
  33.  
  34.  
  35. # Configure Spark Settings
  36. conf=SparkConf()
  37. conf.set("spark.executor.memory", "1g")
  38. conf.set("spark.cores.max", "2")
  39.  
  40. conf.setAppName("Shaz Spark")
  41.  
  42. ## Initialize SparkContext.
  43. sc = SparkContext('local', conf=conf)
  44.  
  45. #Test with a data file, I used an auto data file
  46. lines=sc.textFile("data/auto-data.csv")
  47. print (lines.count())
The code is self-explanatory with the comments I have added to along with it. Please let me know if any clarity is required.

Step 3: We have now our Spark application Up and running with the Spark Context as sc. View the application status using its web UI, to which you can visit using the web address ashttp://localhost:4040/jobs/
With this we conclude this post, and I encourage you to try hand on with different Spark capabilities. I will share some basic Spark operations in my next post. Till then, Happy Sparking.

12 comments:

  1. Hi,

    i getting below error when i run setUpSpark.py file

    " from pyspark import SparkContext
    ModuleNotFoundError: No module named 'pyspark' "

    Kindly any one can help me?
    Thanks in advance.
    Mallesh.K

    ReplyDelete
  2. HI I GOT THIS ERROR WHILE I RUN THIS CODE IN SPYDER , I HAVE ANACONDA PY3.5 VERSION,
    raise Exception("Java gateway process exited before sending the driver its port number")

    Exception: Java gateway process exited before sending the driver its port number

    ReplyDelete
    Replies
    1. It is incompatibility issue with Py4j version.

      Use the Py4j package which is installed in your spark package,

      check in,
      Spark version-> Python-> Lib-> py4j-0.10.4-src.zip

      installed source code.

      use the same package in the code developed.

      example:
      sys.path.insert(0,os.path \

      .join(SPARK_HOME,"python","lib","py4j-0.10.4-src.zip"))

      It should work.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Even I am getting the same error. I have tried Ramki's solution but no use. Could you kindly tell what could be issue.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Using Spyder with Anaconda, got

    Exception: Java gateway process exited before sending the driver its port number

    ReplyDelete
  6. hi after executing "sc = SparkContext('local', conf=conf)" line i am getting error like "
    FileNotFoundError: [WinError 2] The system cannot find the file specified"

    ReplyDelete
  7. aise Exception("Java gateway process exited before sending its port number")

    Exception: Java gateway process exited before sending its port number

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    Surya Informatics

    ReplyDelete