Checking if value exists in PySpark DataFrame column
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+|vals|+----+| A|| A|| B|+----+
Checking if value exists using selectExpr method
To check if value exists in PySpark DataFrame column, use the selectExpr(~)
method like so:
from pyspark.sql import functions as F
+---------------+|any((vals = A))|+---------------+| true|+---------------+
The selectExpr(~)
takes in as argument a SQL expression, and returns a PySpark DataFrame. Here, the SQL expression uses the any(~)
method which returns a True
when the specified condition (vals == "A"
in this case) is satisfied for at least one row and False
otherwise.
The logic is similar to Pandas' any(~)
method - you can think of vals == "A"
returning a boolean mask, and the method any(~)
returning True
if there exists at least one True
in the mask.
The PySpark DataFrame's selectExpr(~)
can be rewritten using PySpark SQL Functions' expr(~)
method:
We recommend using selectExpr(~)
whenever possible because this saves you from having to import the pyspark.sql.functions
library, and the syntax is shorter.
Getting a boolean instead of PySpark DataFrame
In the above solution, the output was a PySpark DataFrame. To extract the result as a boolean indicating whether a value exists or not:
True
Here, selectExpr(~)
returns a PySpark DataFrame. We then call the collect(~)
method which converts the rows of the DataFrame into a list of Row
objects in the driver node:
[Row(any((vals = A))=True)]
We then access the Row
object in the list using [0]
, and then access the value of the Row
using another [0]
to obtain the boolean value.
Checking if values exist given a list
To check if values exist in a PySpark Column given a list:
from pyspark.sql import functions as Flist_vals = ['A','D']
+---------------------+|any((vals IN (A, D)))|+---------------------+| true|+---------------------+
Note the following:
we are checking whether any value in the
vals
column is equal to'A'
or'D'
- we have the value'A'
in the column and so the result is aTrue
.we convert the list into a string tuple (
"('A', 'B')"
) to align with the SQL syntax usingstr(tuple(~))
Checking if values exist using a OR query
The fact that selectExpr(~)
accepts a SQL expression means that we can check for the existence of values flexibly. To check if values exist using an OR
operator:
+-----------+|bool_exists|+-----------+| true|+-----------+
Here, note the following:
we are checking whether the value
B
orC
exists in thevals
column.we assign the label to the column returned by the SQL expression using the alias clause
AS
.
Checking if values exist using a AND query
To check if all the given values exist in a PySpark Column:
+-----------+|bool_exists|+-----------+| true|+-----------+
Here, we are checking whether both the values A
and B
exist in the PySpark column.
Related
selectExpr(~)
method returns a new DataFrame based on the specified SQL expression.expr(~)
method parses the given SQL expression and returns a PySpark Column.