PySpark
keyboard_arrow_down 147 guides
chevron_leftPySpark RDD
check_circle
Mark as learned thumb_up
2
thumb_down
0
chat_bubble_outline
0
Comment auto_stories Bi-column layout
settings
PySpark RDD | collectAsMap method
schedule Aug 12, 2023
Last updated local_offer
Tags PySpark
tocTable of Contents
expand_more Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Start your free 7-days trial now!
PySpark RDD's collectAsMap(~)
method collects all the elements of a pair RDD in the driver nodelink and converts the RDD into a dictionary.
NOTE
A pair RDD is a RDD that contains a list of tuples.
Parameters
This method does not take in any parameters.
Return Value
A dictionary.
Examples
Consider the following PySpark pair RDD:
[('a', 5), ('b', 2), ('c', 3)]
Here, we are using the parallelize(~)
method to create a pair RDD.
Converting a pair RDD into a dictionary in PySpark
To convert a pair RDD into a dictionary in PySpark, use the collectAsMap()
method:
rdd.collectAsMap()
{'a': 5, 'b': 2, 'c': 3}
WARNING
Since all the underlying data in the RDD is sent to driver node, you may encounter an OutOfMemoryError
if the data is too large.
In case of duplicate keys
When we have duplicate keys, the latter key-value pair will overwrite the former ones:
rdd.collectAsMap()
{'a': 6, 'b': 2}
Here, the tuple ("a",6)
has overwritten ("a",5)
.
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
Official PySpark Documentation
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.collectAsMap.html
thumb_up
2
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!