sc.defaultParallelism   # For my configs, this is set to 8
                
            
            8

Return Value

A PySpark RDD (pyspark.rdd.RDD).

Examples

Creating a RDD with a list of values

To create a RDD, use the parallelize(~) function:


        
        
            
                
                
                    rdd = sc.parallelize(["A","B","C","A"])
rdd.collect()
                
            
            ['A', 'B', 'C', 'A']

The default number of partitions as specified by my Spark configuration is:


        
        
            
                
                
                    rdd.getNumPartitions()
                
            
            8

Creating a RDD with specific number of partitions

To create a RDD using a list that has 3 partitions:


        
        
            
                
                
                    rdd = sc.parallelize(["A","B","C","A"], numSlices=3)
rdd.collect()
                
            
            ['A', 'B', 'C', 'A']

Here, Spark is partitioning our list into 3 sub-datasets. We can see the content of each partition using the glom() method:


        
        
            
                
                
                    rdd.glom().collect()
                
            
            [['A'], ['B'], ['C', 'A']]

We can indeed see that there are 3 partitions:

Partition one: 'A'
Partition two: 'B'
Partition three: 'C' and 'A'

Notice how the same value 'A' does not necessarily end up in the same partition - the partitioning is done naively based on the ordering of the list.

Creating a pair RDD

To create a pair RDD, pass a list of tuples like so:


        
        
            
                
                
                    rdd = sc.parallelize([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
rdd.collect()
                
            
            [('A', 1), ('B', 1), ('C', 1), ('A', 1)]

Note that parallelize will not perform partitioning based on the key, as shown here:


        
        
            
                
                
                    rdd.glom().collect()
                
            
            [[('A', 1)], [('B', 1)], [('C', 1), ('A', 1)]]

We can see that just like the previous case, the partitioning is done using the ordering of the list.

NOTE

A pair RDD is not a type of its own:


        
        
            
                
                
                    type(rdd)
                
            
            pyspark.rdd.RDD

What makes pair RDDs special is that, we can perform additional methods such as reduceByKey(~), which performs a groupby on the key and perform a custom reduction function:


        
        
            
                
                
                    rdd = sc.parallelize([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
new_rdd = rdd.reduceByKey(lambda a,b: a+b)
new_rdd.collect()
                
            
            [('B', 1), ('C', 1), ('A', 2)]

Here, the reduction function that we used is a simple summation.

Creating a RDD from a Pandas DataFrame

Consider the following Pandas DataFrame:


        
        
            
                
                
                    import pandas as pd
df_pandas = pd.DataFrame({"A":[3,4],"B":[5,6]})
df_pandas
                
            
               A  B
0  3  5
1  4  6

To create a RDD that contains the values of this Pandas DataFrame:


        
        
            
                
                
                    df_spark = spark.createDataFrame(df_pandas)
rdd = df_spark.rdd
rdd.collect()
                
            
            [Row(A=3, B=5), Row(A=4, B=6)]

Notice how only the values of the DataFrame are kept - column labels are not included in the RDD.

WARNING

Even though parallelize(~) can accept a Pandas DataFrame directly, this does not give us the desired RDD:


        
        
            
                
                
                    import pandas as pd
df_pandas = pd.DataFrame({"A":[3,4],"B":[5,6]})
rdd = sc.parallelize(df_pandas)
rdd.collect()
                
            
            ['A', 'B']

As you can see, the rdd only contains the column labels but not the data itself.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.parallelize.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!