rdd = sc.parallelize([("a",5),("a",1),("b",2),("c",4)])
rdd.collect()
                
            
            [('a', 5), ('a', 1), ('b', 2), ('c', 4)]

Here, we are using the parallelize(~) method to create a RDD.

Getting the count of each group in PySpark Pair RDD

To group by the key, and get the count of each group:


        
        
            
                
                
                    rdd.countByKey()
                
            
            defaultdict(int, {'a': 2, 'b': 1, 'c': 1})

Here, the returned value is DefaultDict, which is basically a dictionary in which accessing values that do not exist in the dictionary will return a 0 instead of throwing an error.

You can access the count of a key just as you would for an ordinary dictionary:


        
        
            
                
                
                    counts = rdd.countByKey()
counts["a"]
                
            
            2

Accesing counts of keys that do not exist will return 0:


        
        
            
                
                
                    counts = rdd.countByKey()
counts["z"]
                
            
            0

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.countByKey.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!