PySpark SparkSession | range method
Start your free 7-days trial now!
PySpark SparkSession's range(~)
method creates a new PySpark DataFrame using a series of values - this method is similar to Python's standard range(~)
method.
Parameters
1. start
| int
The starting value (inclusive).
2. end
| int
| optional
The ending value (exclusive).
3. step
| int
| optional
The value by which to increment. By default, step=1
.
4. numPartitions
| int
| optional
The number of partitions to divide the values in.
Return Value
A PySpark DataFrame.
Examples
Creating a PySpark DataFrame using range (series of values)
To create a PySpark DataFrame that holds a series of values, use the range(~)
method:
df = spark.range(1,4)
+---+| id|+---+| 1|| 2|| 3|+---+
Notice how the starting value is included while the ending value is not.
Note that if only one argument is supplied, then the range will start from 0 (inclusive) and the argument will represent the end-value (exclusive):
df = spark.range(3)
+---+| id|+---+| 0|| 1|| 2|+---+
Setting an incremental value
Instead of the default incremental value of step=1
, we can choose a specific incremental value using the third argument:
df = spark.range(1,6,2)
+---+| id|+---+| 1|| 3|| 5|+---+
Series of values in descending order
We can also get a series of values in descending order:
df = spark.range(4,1,-1)
+---+| id|+---+| 4|| 3|| 2|+---+
Note the following:
the starting value must be larger than the ending value
the incremental value must be negative.
Specifying the number of partitions
By default, the number of partitions in which the resulting PySpark DataFrame will be split is governed by our PySpark configuration. In my case, the default number of partitions is 8:
df = spark.range(1,4)
8
We can override our configuration by specifying the numPartitions
parameter:
df = spark.range(1,4, numPartitions=2)
2