Wednesday, March 27, 2019

How to use correct pyspark & spark version in Python

  • Method 1: Use envronment Variable
    – Variable can be set using export SPARK_HOME=’/Users/donghua/spark-2.4.0-bin-hadoop2.7’ before executing python command
    – below method set variable in the python script, useful for notebook envronment

Donghuas-MacBook-Air:spark-2.4.0-bin-hadoop2.7 donghua$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['SPARK_HOME'] = '/Users/donghua/spark-2.4.0-bin-hadoop2.7'
>>> os.environ['PYTHONPATH'] = '/Users/donghua/spark-2.4.0-bin-hadoop2.7/python:/Users/donghua/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip'
>>> os.environ['PYSPARK_PYTHON'] = 'python3'
>>> os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'
>>> print (os.environ['SPARK_HOME'] )
/Users/donghua/spark-2.4.0-bin-hadoop2.7
>>> from pyspark import SparkContext
>>> sc = SparkContext('local','handson Spark')
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> print(sc.version)
2.4.0
>>> exit()
    • Method 2: Use findspark package (recommended)
    Donghuas-MacBook-Air:spark-2.4.0-bin-hadoop2.7 donghua$ python
    Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
    [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import findspark
    >>> findspark.init('/Users/donghua/spark-2.4.0-bin-hadoop2.7')
    >>> from pyspark import SparkContext
    >>> sc = SparkContext('local','handson Spark')
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    >>> print(sc.version)
    2.4.0
    >>> exit()
    
    Without setting the variable, it use default spark home, the outcome depends on where pyspark packages installed (in this case, Spark 2.3.2 used instead of 2.4.0)
    Donghuas-MacBook-Air:spark-2.4.0-bin-hadoop2.7 donghua$ python
    Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
    [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from pyspark import SparkContext
    >>> sc = SparkContext('local','handson Spark')
    2019-03-27 18:22:32 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    >>> 
    >>> print(sc.version)
    2.3.2
    >>> exit()
    

    No comments:

    Post a Comment