PySpark RDD | coalesce method
Start your free 7-days trial now!
PySpark RDD's coalesce(~)
method returns a new RDD with the number of partitions reduced.
Parameters
1. numPartitions
| int
The number of partitions to reduce to.
2. shuffle
| boolean
| optional
Whether or not to shuffle the data such that they end up in different partitions. By default, shuffle=False
.
Return Value
A PySpark RDD (pyspark.rdd.RDD
).
Examples
Consider the following RDD with 3 partitions:
[['A'], ['B', 'C'], ['D', 'A']]
Here:
parallelize(~)
creates a RDD with 3 partitionsglom()
shows the actual content of each partition.
Reducing the number of partitions of RDD
To reduce the number of partitions to 2:
We can see that the 2nd partition merged with the 3rd partition.
Balanced partitioning of RDD using shuffle
Instead of merging partitions to reduce the number partitions, we can also shuffle the data:
As you can see, this results in a partitioning that is more balanced. The downside to shuffling, however, is that this is a costly process when your data size is large since data must be transferred from one worker node to another.