PySpark RDD | zipWithIndex method
Start your free 7-days trial now!
PySpark RDD's zipWithIndex(~)
method returns a RDD of tuples where the first element of the tuple is the value and the second element is the index. The first value of the first partition will be given an index of 0.
Parameters
This method does not take in any parameters.
Return Value
A new PySpark RDD.
Examples
Consider the following PySpark RDD with 2 partitions:
['A', 'B', 'C']
We can see the content of each partition using the glom()
method:
We see that we indeed have 2 partitions with the first partition containing the value 'A'
, and the second containing the values 'B'
and 'C'
.
We can create a new RDD of tuples containing positional index information using zipWithIndex(~)
:
new_rdd = rdd.zipWithIndex()
[('A', 0), ('B', 1), ('C', 2)]
We see that the index position is assigned based on the partitioning position - the first element of the first partition will be assigned the 0th index.