PySpark
keyboard_arrow_down 147 guides
chevron_leftPySpark RDD
check_circle
Mark as learned thumb_up
0
thumb_down
0
chat_bubble_outline
0
Comment auto_stories Bi-column layout
settings
PySpark RDD | countByKey method
schedule Aug 12, 2023
Last updated local_offer
Tags PySpark
tocTable of Contents
expand_more Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Start your free 7-days trial now!
PySpark RDD's countByKey(~)
method groups by the key of the elements in a pair RDD, and counts each group.
Parameters
This method does not take in any parameter.
Return Value
A DefaultDict[key,int]
.
Examples
Consider the following PySpark pair RDD:
[('a', 5), ('a', 1), ('b', 2), ('c', 4)]
Here, we are using the parallelize(~)
method to create a RDD.
Getting the count of each group in PySpark Pair RDD
To group by the key, and get the count of each group:
rdd.countByKey()
defaultdict(int, {'a': 2, 'b': 1, 'c': 1})
Here, the returned value is DefaultDict
, which is basically a dictionary in which accessing values that do not exist in the dictionary will return a 0
instead of throwing an error.
You can access the count of a key just as you would for an ordinary dictionary:
counts = rdd.countByKey()counts["a"]
2
Accesing counts of keys that do not exist will return 0
:
counts = rdd.countByKey()counts["z"]
0
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
Official PySpark Documentation
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.countByKey.html
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!