PySpark DataFrame | coalesce method
Start your free 7-days trial now!
PySpark DataFrame's coalesce(~)
method reduces the number of partitions of the PySpark DataFrame without shuffling.
Parameters
1. num_partitions
| int
The number of partitions to split the PySpark DataFrame's data into.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
The default number of partitions is governed by your PySpark configuration. In my case, the default number of partitions is:
8
We can see the actual content of each partition of the PySpark DataFrame by using the underlying RDD's glom()
method:
We can see that we indeed have 8 partitions, 3 of which contain a Row
.
Reducing the number of partitions of a PySpark DataFrame without shuffling
To reduce the number of partitions of the DataFrame without shufflinglink, use coalesce(~)
:
Here, we can see that we now only have 2 partitions!
Both the methods repartition(~)
and coalesce(~)
are used to change the number of partitions, but here are some notable differences:
repartition(~)
generally results in a shuffling operationlink whilecoalesce(~)
does not. This means thatcoalesce(~)
is less costly thanrepartition(~)
because the data does not have to travel across the worker nodes much.coalesce(~)
is used specifically for reducing the number of partitions.