rdd = sc.parallelize(["A","B","C","D","A"], numSlices=3)
rdd.glom().collect()
                
            
            [['A'], ['B', 'C'], ['D', 'A']]

Here:

parallelize(~) creates a RDD with 3 partitions
glom() shows the actual content of each partition.

Reducing the number of partitions of RDD

To reduce the number of partitions to 2:


        
        
            
                
                
                    new_rdd = rdd.coalesce(numPartitions=2)
new_rdd.glom().collect()
                
            
            [['A'], ['B', 'C', 'D', 'A']]

We can see that the 2nd partition merged with the 3rd partition.

Balanced partitioning of RDD using shuffle

Instead of merging partitions to reduce the number partitions, we can also shuffle the data:


        
        
            
                
                
                    new_rdd = rdd.coalesce(numPartitions=2, shuffle=True)
new_rdd.glom().collect()
                
            
            [['A', 'D', 'A'], ['B', 'C']]

As you can see, this results in a partitioning that is more balanced. The downside to shuffling, however, is that this is a costly process when your data size is large since data must be transferred from one worker node to another.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.coalesce.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!