Friday, November 17, 2017

RDD Lineage and Persistence

Example 1: Without Persistence

>>> hs=sc.textFile('hdfs://cdh-vm/user/donghua/hadoopsecurity.txt')
>>> rdd1=hs.map(lambda line:line.upper())
>>> rdd2=rdd1.filter(lambda line:line.startswith('E'))
>>> rdd2.collect()
[u'ENABLING KERBEROS AUTHENTICATION USING CLOUDERA MANAGER']

>>> print rdd2.toDebugString()
(2) PythonRDD[15] at collect at <stdin>:1 []
  |  hdfs://cdh-vm/user/donghua/hadoopsecurity.txt MapPartitionsRDD[14] at textFile at NativeMethodAccessorImpl.java:-2 []
  |  hdfs://cdh-vm/user/donghua/hadoopsecurity.txt HadoopRDD[13] at textFile at NativeMethodAccessorImpl.java:-2 []

Example 2: With default Persistence for RDD1


>>> hs=sc.textFile('hdfs://cdh-vm/user/donghua/hadoopsecurity.txt')
> >> rdd1=hs.map(lambda line:line.upper())
>>> rdd1.persist()
PythonRDD[18] at RDD at PythonRDD.scala:43
>>> rdd2=rdd1.filter(lambda line:line.startswith('E'))
>>> rdd2.collect()
[u'ENABLING KERBEROS AUTHENTICATION USING CLOUDERA MANAGER']


>>> print rdd2.toDebugString()
(2) PythonRDD[19] at collect at <stdin>:1 []
|  PythonRDD[18] at RDD at PythonRDD.scala:43 []
  |      CachedPartitions: 2; MemorySize: 577.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
  |  hdfs://cdh-vm/user/donghua/hadoopsecurity.txt MapPartitionsRDD[17] at textFile at NativeMethodAccessorImpl.java:-2 []
  |  hdfs://cdh-vm/user/donghua/hadoopsecurity.txt HadoopRDD[16] at textFile at NativeMethodAccessorImpl.java:-2 []

Example 3: With default Persistence for HS, RDD1 and Memory_and_disk Persistence for RDD2

>>> from pyspark import StorageLevel
>>> hs=sc.textFile('hdfs://cdh-vm/user/donghua/hadoopsecurity.txt')
>>> hs.persist()
hdfs://cdh-vm/user/donghua/hadoopsecurity.txt MapPartitionsRDD[25] at textFile at NativeMethodAccessorImpl.java:-2
>>> rdd1=hs.map(lambda line:line.upper())
>>> rdd1.persist()
PythonRDD[26] at RDD at PythonRDD.scala:43
>>> rdd2=rdd1.filter(lambda line:line.startswith('E'))
>>> rdd2.persist(StorageLevel.MEMORY_AND_DISK)
PythonRDD[27] at RDD at PythonRDD.scala:43

>>> rdd2.collect()
[u'ENABLING KERBEROS AUTHENTICATION USING CLOUDERA MANAGER']
>>> print rdd2.toDebugString()
(2) PythonRDD[27] at RDD at PythonRDD.scala:43 [Disk Memory Deserialized 1x Replicated]
  |       CachedPartitions: 2; MemorySize: 128.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
  |  PythonRDD[26] at RDD at PythonRDD.scala:43 [Disk Memory Deserialized 1x Replicated]
  |      CachedPartitions: 2; MemorySize: 577.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
  |  hdfs://cdh-vm/user/donghua/hadoopsecurity.txt MapPartitionsRDD[25] at textFile at NativeMethodAccessorImpl.java:-2 [Disk Memory Deserialized 1x Replicated]
  |      CachedPartitions: 2; MemorySize: 491.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
  |  hdfs://cdh-vm/user/donghua/hadoopsecurity.txt HadoopRDD[24] at textFile at NativeMethodAccessorImpl.java:-2 [Disk Memory Deserialized 1x Replicated]