Using SQL against a PySpark DataFrame
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
Registering PySpark DataFrame as a SQL table
Before we can run SQL queries against a PySpark DataFrame, we must first register the DataFrame as a SQL table:
df.createOrReplaceTempView("users")
Here, we have registered the DataFrame as a SQL table called users
. The temporary table will be dropped whenever the Spark session ends. On the other hand, createGlobalTempView(~)
will be shared across Spark sessions, and will only be dropped whenever the Spark application ends.
Running SQL queries against PySpark DataFrame
We can now run SQL queries against our PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
Only read-only SQL statements are allowed - data manipulation language (DML) statements such as UPDATE
and DELETE
are not supported since PySpark has no notion of transactions.
Using variables in SQL queries
The sql(~)
method takes in a SQL query expression (string
), and so incorporating variables can be done using f-string:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+