Getting earliest and latest date in PySpark DataFrame
Start your free 7-days trial now!
Getting earliest and latest date for date columns
Consider the following PySpark DataFrame:
import datetimedf = spark.createDataFrame([['Alex', datetime.date(1998,12,16)], ['Bob', datetime.date(1995,5,9)]], ['name', 'birthday'])
+----+----------+|name| birthday|+----+----------+|Alex|1998-12-16|| Bob|1995-05-09|+----+----------+
Here, birthday
is of type date
:
root |-- name: string (nullable = true) |-- birthday: date (nullable = true)
Use the F.min(~)
method to get the earliest date, and use the F.max(~)
method to get the latest date:
+----------+----------+| earliest| latest|+----------+----------+|1995-05-09|1998-12-16|+----------+----------+
Here, we are using the alias(~)
method to assign a label to the PySpark column returned by F.min(~)
and F.max(~)
.
To extract the earliest and latest dates as variables instead of a PySpark DataFrame:
print(f'Earliest date: {list_rows[0][0]}') # type is datetime.dateprint(f'Latest date: {list_rows[0][1]}')
Earliest date: 1995-05-09Latest date: 1998-12-16
Here, we are using the PySpark DataFrame's collect()
method to convert the row into a list of Row
object in the driver node:
list_rows
[Row(earliest=datetime.date(1995, 5, 9), latest=datetime.date(1998, 12, 16))]
Getting earliest and latest date for date string columns
The above solution works when the column is of type date
. If you have date strings, then you must first convert the date strings into native dates using the to_date(~)
method.
For example, consider the following PySpark DataFrame with some date strings:
+----+----------+|name| birthday|+----+----------+|Alex|1998-12-16|| Bob| 1995-5-9|+----+----------+
We can convert the date strings to native dates using to_date(~)
:
Here, the second argument of to_date(~)
specifies the format of the date string.