PySpark RDD | partitionBy method
Start your free 7-days trial now!
PySpark RDD's partitionBy(~)
method re-partitions a pair RDD into the desired number of partitions.
Parameters
1. numPartitions
| int
The desired number of partitions of the resulting RDD.
2. partitionFunc
| function
| optional
The partitioning function - the input is the key and the return value must be the hashed value. By default, a hash partitioner will be used.
Return Value
A PySpark RDD (pyspark.rdd.RDD
).
Examples
Repartitioning a pair RDD
Consider the following RDD:
# Create a RDD with 3 partitions
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]
To see how this RDD is partitioned, use the glom()
method:
We can indeed see that there are 3 partitions:
Partition one:
('A',1)
and('B',1)
Partition two:
('C',1)
Partition three:
('A',1)
To re-partition into 2 partitions:
[('C', 1), ('A', 1), ('B', 1), ('A', 1)]
To see the contents of the new partitions:
We can indeed see that there are 2 partitions:
Partition one:
('C',1)
Partition two:
('A',1)
,('B',1)
,('A', 1)
Notice how the tuple with the key A
has ended up in the same partition. This is guaranteed to happen because the hash partitioner will perform bucketing based on the tuple key.