PySpark
keyboard_arrow_down 147 guides
chevron_leftPySpark RDD
check_circle
Mark as learned thumb_up
1
thumb_down
0
chat_bubble_outline
0
Comment auto_stories Bi-column layout
settings
PySpark RDD | collect method
schedule Aug 12, 2023
Last updated local_offer
Tags PySpark
tocTable of Contents
expand_more Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Start your free 7-days trial now!
PySpark RDD's collect(~)
method returns a list containing all the items in the RDD.
Parameters
This method does not take in any parameters.
Return Value
A Python standard list.
Examples
Converting a PySpark RDD into a list of values
Consider the following RDD:
rdd
ParallelCollectionRDD[7] at readRDDFromInputStream at PythonRDD.scala:413
This RDD is partitioned into 8 subsets:
8
Depending on your configuration, these 8 partitions can reside in multiple machines (working nodes). The collect(~)
method sends all the data of the RDD to the driver node, and packs them in a single list:
rdd.collect()
[4, 2, 5, 7]
WARNING
All the data from the worker nodes will be sent to the driver node, so make sure that you have enough memory for the driver node - otherwise you'll end up with an OutOfMemory
error!
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
Official PySpark Documentation
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.collect.html
thumb_up
1
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!