PySpark SQL Functions | first method
Start your free 7-days trial now!
PySpark's SQL function first(~)
method returns the first value of the specified column of a PySpark DataFrame.
Parameters
1. col
| string
or Column
object
The column label or Column
object of interest.
2. ignorenulls
| boolean
| optional
Whether or not to ignore null values. By default, ignorenulls=False
.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column
).
Examples
Consider the following PySpark DataFrame:
columns = ["name", "age"]data = [("Alex", 15), ("Bob", 20), ("Cathy", 25)]
+-----+---+| name|age|+-----+---+| Alex| 15|| Bob| 20||Cathy| 25|+-----+---+
Getting the first value of a column in PySpark DataFrame
To get the first value of the name
column:
Getting the first non-null value of a column in PySpark DataFrame
Consider the following PySpark DataFrame with null values:
columns = ["name", "age"]data = [("Alex", None), ("Bob", 20), ("Cathy", 25)]
+-----+----+| name| age|+-----+----+| Alex|null|| Bob| 20||Cathy| 25|+-----+----+
By default, ignorenulls=False
, which means that the first value is returned regardless of whether it is null
or not:
To return the first non-null value instead:
Getting the first value of each group in PySpark
The first(~)
method is also useful in aggregations. Consider the following PySpark DataFrame:
columns = ["name", "class"]data = [("Alex", "A"), ("Alex", "B"), ("Bob", None), ("Bob", "A"), ("Cathy", "C")]
+-----+-----+| name|class|+-----+-----+| Alex| A|| Alex| B|| Bob| null|| Bob| A||Cathy| C|+-----+-----+
To get the first value of each aggregate:
Here, we are grouping by name
, and then for each of these group, we are obtaining the first value that occurred in the class
column.