Monday, March 18, 2019

RDD Lambda function on array raise syntax error in Spark 2.4

Spark 1.6

[donghua@cdh5 ~]$ pyspark
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkContext available as sc, HiveContext available as sqlContext.
>>> rdd1 = sc.textFile('file:///tmp/postal.txt')
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
[(u'00210', (u'43.00589', u'-71.01320')), (u'01014', (u'42.17073', u'-72.60484'))]
>>> 
[donghua@cdh5 ~]$ 

Spark 2.3
[donghua@cdh5 ~]$ pyspark2
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/03/18 09:55:51 WARN lineage.LineageWriter: Lineage directory /var/log/spark2/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
19/03/18 09:55:52 WARN lineage.LineageWriter: Lineage directory /var/log/spark2/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0.cloudera4
      /_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>> rdd1 = sc.textFile('file:///tmp/postal.txt')
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
[(u'00210', (u'43.00589', u'-71.01320')), (u'01014', (u'42.17073', u'-72.60484'))]
>>> 

Spark 2.4

onghuas-MacBook-Air:data donghua$ cd /Users/donghua/spark-2.4.0-bin-hadoop2.7;/Users/donghua/spark-2.4.0-bin-hadoop2.7/bin/pyspark --master spark://Donghuas-MacBook-Air.local:7077
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-03-18 09:57:59 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 3.6.8 (default, Dec 29 2018 19:04:46)
SparkSession available as 'spark'.
>>> rdd1 = sc.textFile('file:///Users/donghua/spark-2.4.0-bin-hadoop2.7/data/data/postal.txt')
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
  File "", line 1
    rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
                                                            ^
SyntaxError: invalid syntax
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda v: (v[0], (v[1].split('\t')[1],v[1].split('\t')[2]))).take(2)
2019-03-18 09:59:23 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[('00210', ('43.00589', '-71.01320')), ('01014', ('42.17073', '-72.60484'))]    
>>> 

Reference: 
PEP 3113 -- Removal of Tuple Parameter Unpacking
https://www.python.org/dev/peps/pep-3113/


No comments:

Post a Comment