PySpark SparkContext | parallelize method
Start your free 7-days trial now!
PySpark SparkContext's parallelize(~)
method creates a RDD (resilient distributed dataset) from the given dataset.
Parameters
1. c
| any
The data you want to convert into RDD. Typically, you would pass a list of values.
2. numSlices
| int
| optional
The number of partitions to use. By default, the parallelism level set in the Spark configuration will be used for the number of partitions:
sc.defaultParallelism # For my configs, this is set to 8
8
Return Value
A PySpark RDD (pyspark.rdd.RDD
).
Examples
Creating a RDD with a list of values
To create a RDD, use the parallelize(~)
function:
rdd = sc.parallelize(["A","B","C","A"])
['A', 'B', 'C', 'A']
The default number of partitions as specified by my Spark configuration is:
8
Creating a RDD with specific number of partitions
To create a RDD using a list that has 3
partitions:
rdd = sc.parallelize(["A","B","C","A"], numSlices=3)
['A', 'B', 'C', 'A']
Here, Spark is partitioning our list into 3 sub-datasets. We can see the content of each partition using the glom()
method:
We can indeed see that there are 3 partitions:
Partition one:
'A'
Partition two:
'B'
Partition three:
'C'
and'A'
Notice how the same value 'A'
does not necessarily end up in the same partition - the partitioning is done naively based on the ordering of the list.
Creating a pair RDD
To create a pair RDD, pass a list of tuples like so:
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]
Note that parallelize will not perform partitioning based on the key, as shown here:
We can see that just like the previous case, the partitioning is done using the ordering of the list.
What makes pair RDDs special is that, we can perform additional methods such as reduceByKey(~)
, which performs a groupby
on the key and perform a custom reduction function:
rdd = sc.parallelize([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
[('B', 1), ('C', 1), ('A', 2)]
Here, the reduction function that we used is a simple summation.
Creating a RDD from a Pandas DataFrame
Consider the following Pandas DataFrame:
A B0 3 51 4 6
To create a RDD that contains the values of this Pandas DataFrame:
[Row(A=3, B=5), Row(A=4, B=6)]
Notice how only the values of the DataFrame are kept - column labels are not included in the RDD.
Even though parallelize(~)
can accept a Pandas DataFrame directly, this does not give us the desired RDD:
As you can see, the rdd
only contains the column labels but not the data itself.