PySpark
keyboard_arrow_down 147 guides
chevron_leftPySpark RDD
check_circle
Mark as learned thumb_up
0
thumb_down
0
chat_bubble_outline
0
Comment auto_stories Bi-column layout
settings
PySpark RDD | zip method
schedule Aug 12, 2023
Last updated local_offer
Tags PySpark
tocTable of Contents
expand_more Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Start your free 7-days trial now!
PySpark RDD's zip(~)
method combines the elements of two RDDs into a single RDD of tuples.
Parameters
1. other
| RDD
The other RDD to combine with.
Return Value
A new PySpark RDD.
Examples
Combining two PySpark RDDs into a single RDD of tuples
Consider the following two PySpark RDDs:
Here, we are using the parallelize(~)
method to create two RDDs, each having 3 partitions.
We can see the actual values in each partition using the glom(~)
method:
We see that RDD x
indeed has 3 partitions, and we have 2 elements in each partition. The same can be said for RDD y
:
We can combine the two RDDs x
and y
into a single RDD of tuples using the zip(~)
method:
zipped_rdd = x.zip(y)
[(0, 10), (1, 11), (2, 12), (3, 13), (4, 14), (5, 15)]
WARNING
In order to use the zip(~)
method, the two RDDs must have the exact same number of partitions as well as the exact same number of elements in each partition.
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
Official PySpark Documentation
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.zip.html
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!