PySpark DataFrame | selectExpr method
Start your free 7-days trial now!
PySpark DataFrame's selectExpr(~)
method returns a new DataFrame based on the specified SQL expression.
Parameters
1. *expr
| string
The SQL expression.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
Selecting data using SQL expressions in PySpark DataFrame
To get a new DataFrame where the values for the name
column is uppercased:
+----------+---------+|upper_name|(age * 2)|+----------+---------+| ALEX| 40|| BOB| 60|| CATHY| 80|+----------+---------+
We should use selectExpr(~)
rather than select(~)
to extract columns while performing some simple transformations on them - just as we have done here.
There exists a similar method expr(~)
in the pyspark.sql.functions
library. expr(~)
also takes in as argument a SQL expression, but the difference is that the return type is a PySpark Column
. The following usage of selectExpr(~)
and expr(~)
are equivalent:
In general, you should use selectExpr(~)
rather than expr(~)
because:
you won't have to import the
pyspark.sql.functions
library.the syntax is shorter and clearer
Parsing more complex SQL expressions
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 20|| Bob| 60|+----+---+
We can use classic SQL clauses like AND
and LIKE
to formulate more complicated expressions:
+------+|result|+------+| true|| false|+------+
Here, we are checking for rows where age
is less than 30
and the name
starts with the letter A
.
Note that we can implement the same logic like so:
+------+|result|+------+| true|| false|+------+
I personally prefer using selectExpr(~)
because the syntax is cleaner and the meaning is intuitive for those who are familiar with SQL.
Checking for the existence of values in PySpark column
Another application of selectExpr(~)
is to check for the existence of values in a PySpark column. Please check out the recipe here.