Spark Scala

RDD Internal

Deep Dive and Notes

Posted by Hanke on December 24, 2020

RDD Five Main Properties

A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- My own notes: the partitioner will be used for further shuffle check conditon
Optionally, a list of preferred location to compute each split on (e.g. block partitions for a HDFS file)
- The preferred location if it has, will be used for later data locality