PySpark DataFrame | repartition method
Start your free 7-days trial now!
PySpark DataFrame's repartition(~)
method returns a new PySpark DataFrame with the data split into the specified number of partitions. This method also allows to partition by column values.
Parameters
1. numPartitions
| int
The number of patitions to break down the DataFrame.
2. cols
| str
or Column
The columns by which to partition the DataFrame.
Return Value
A new PySpark DataFrame.
Examples
Partitioning a PySpark DataFrame
Cosnider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
By default, the number of partitions depends on the parallelism level of your PySpark configuration:
8
In my case, our PySpark DataFrame is split into 8 partitions by default.
We can see how the rows of our DataFrame are partitioned using the glom()
method of the underlying RDD:
Here, we can see that we have indeed 8 partitions, but only 3 of the partitions have a Row
in them.
Now, let's repartition our DataFrame such that the Rows are divided into only 2 partitions:
df_new = df.repartition(2)
2
The distribution of the rows in our repartitioned DataFrame is now:
As demonstrated here, there is no guarantee that the rows will be evenly distributed in the partitions.
Partitioning a PySpark DataFrame by column values
Consider the following PySpark DataFrame:
df = spark.createDataFrame([("Alex", 20), ("Bob", 30), ("Cathy", 40), ("Alex", 50)], ["name", "age"])
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|| Alex| 50|+-----+---+
To repartition this PySpark DataFrame by the column name
into 2 partitions:
[[Row(name='Alex', age=20), Row(name='Cathy', age=40), Row(name='Alex', age=50)], [Row(name='Bob', age=30)]]
Here, notice how the rows with the same value for name
('Alex'
in this case) end up in the same partition.
We can also repartition by multiple column values:
Here, we are repartitioning by the name
and age
columns into 4
partitions.
We can also use the default number of partitions by specifying column labels only:
df_new = df.repartition("name")
1