df = spark.createDataFrame([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 30|
|Cathy| 40|
+-----+---+

The default number of partitions is governed by your PySpark configuration. In my case, the default number of partitions is:


        
        
            
                
                
                    df.rdd.getNumPartitions()
                
            
            8

We can see the actual content of each partition of the PySpark DataFrame by using the underlying RDD's glom() method:


        
        
            
                
                
                    df.rdd.glom().collect()
                
            
            [[],
 [],
 [Row(name='Alex', age=20)],
 [],
 [],
 [Row(name='Bob', age=30)],
 [],
 [Row(name='Cathy', age=40)]]

We can see that we indeed have 8 partitions, 3 of which contain a Row.

Reducing the number of partitions of a PySpark DataFrame without shuffling

To reduce the number of partitions of the DataFrame without shufflinglink, use coalesce(~):


        
        
            
                
                
                    df_new = df.coalesce(2)
df_new.rdd.glom().collect()
                
            
            [[Row(name='Alex', age=20)],
 [Row(name='Bob', age=30), Row(name='Cathy', age=40)]]

Here, we can see that we now only have 2 partitions!

NOTE

Both the methods repartition(~) and coalesce(~) are used to change the number of partitions, but here are some notable differences:

repartition(~) generally results in a shuffling operationlink while coalesce(~) does not. This means that coalesce(~) is less costly than repartition(~) because the data does not have to travel across the worker nodes much.
coalesce(~) is used specifically for reducing the number of partitions.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.spark.coalesce.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!