rdd = sc.parallelize(['A','B','C'], 2)
rdd.collect()
                
            
            ['A', 'B', 'C']

We can see the content of each partition using the glom() method:


        
        
            
                
                
                    rdd.glom().collect()
                
            
            [['A'], ['B', 'C']]

We see that we indeed have 2 partitions with the first partition containing the value 'A', and the second containing the values 'B' and 'C'.

We can create a new RDD of tuples containing positional index information using zipWithIndex(~):


        
        
            
                
                
                    new_rdd = rdd.zipWithIndex()
new_rdd.collect()
                
            
            [('A', 0), ('B', 1), ('C', 2)]

We see that the index position is assigned based on the partitioning position - the first element of the first partition will be assigned the 0th index.

PySpark RDD | zip method

PySpark RDD's zip(~) method combines the elements of two RDDs into a single RDD of tuples.

chevron_right