PySpark DataFrame | coalesce method
Start your free 7-days trial now!
PySpark DataFrame's coalesce(~) method reduces the number of partitions of the PySpark DataFrame without shuffling.
Parameters
1. num_partitions | int
The number of partitions to split the PySpark DataFrame's data into.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
The default number of partitions is governed by your PySpark configuration. In my case, the default number of partitions is:
8
We can see the actual content of each partition of the PySpark DataFrame by using the underlying RDD's glom() method:
We can see that we indeed have 8 partitions, 3 of which contain a Row.
Reducing the number of partitions of a PySpark DataFrame without shuffling
To reduce the number of partitions of the DataFrame without shufflinglink, use coalesce(~):
Here, we can see that we now only have 2 partitions!
Both the methods repartition(~) and coalesce(~) are used to change the number of partitions, but here are some notable differences:
repartition(~)generally results in a shuffling operationlink whilecoalesce(~)does not. This means thatcoalesce(~)is less costly thanrepartition(~)because the data does not have to travel across the worker nodes much.coalesce(~)is used specifically for reducing the number of partitions.