PySpark RDD | repartition method
Start your free 7-days trial now!
PySpark RDD's repartition(~)
method splits the RDD into the specified number of partitions.
When we first create RDDs, they will already be partitioned under the hood, which means that all RDDs are already partitioned. This method is called repartition(~)
(emphasis on the re
) because we are changing the existing partitioning.
Parameters
1. numPartitions
| int
The number of partitions in which to split the RDD.
Return Value
A PySpark RDD (pyspark.rdd.RDD
).
Examples
Re-partitioning a RDD with certain number of partitions
Consider the following RDD:
['A', 'B', 'C', 'A', 'A', 'B']
Here, we are using the parallelize(~)
method to create a RDD with 3 partitions.
We can use the glom()
method to see the actual content of the partitions:
To repartition our RDD into 2 partitions:
Notice how even if we repartition our RDD:
the same values do not necessarily end up in the same partition (
'A'
can be found in both partitions)the number of elements in each partition may also not be balanced - here we have 4 elements in the first partition, while only 2 elements in the second partition.
The repartition(~)
method involves shufflinglink, even when reducing the number of partitions. To avoid shuffling when reducing the number of partitions, use RDD's coalesce(~)
method instead.