PySpark DataFrame | where method
Start your free 7-days trial now!
PySpark DataFrame's where(~)
method returns rows of the DataFrame that satisfies the given condition.
The where(~)
method is an alias for the filter(~)
method.
Parameters
1. condition
| Column
or string
A boolean mask (Column
) or a SQL string expression.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
Basic usage
To get rows where age
is greater than 25:
+-----+---+| name|age|+-----+---+| Bob| 30||Cathy| 40|+-----+---+
Equivalently, we can pass a Column
object that represents a boolean mask:
+-----+---+| name|age|+-----+---+| Bob| 30||Cathy| 40|+-----+---+
Equivalently, we can use the col(~)
function of sql.functions
to refer to the column:
Compound queries
The where(~)
method supports the AND
and OR
statement like so:
+----+---+|name|age|+----+---+| Bob| 30|+----+---+
Dealing with null values
Consider the following PySpark DataFrame:
+-----+----+| name| age|+-----+----+| Alex| 20|| null|null||Cathy|null|+-----+----+
Let's query for rows where age!=10
like so:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Notice how only Alex's row is returned even though the other two rows technically have age!=10
. This happens because PySpark's where(-)
method filters our null values by default.
To prevent rows with null values getting filtered out, we can perform the query like so:
Note that PySpark's treatment of null values is different compared to Pandas because Pandas will retain rows with missing values, as demonstrated below:
col1 b2 None
Notice how the row with col=None
is not left out!