Getting shortest and longest strings in PySpark DataFrame
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+|vals|+----+| aaa|| bb||cccc|| dd||eeee|+----+
In order to run SQL queries against our PySpark DataFrame, we first need to register the DataFrame like so:
df.createOrReplaceTempView('my_vals')
Here, we are assigning the label my_vals
, which becomes the SQL table name.
Getting the shortest string
To get the shortest string in the vals
column of our PySpark DataFrame, we can run the following SQL query:
+----+|vals|+----+| bb|+----+
Note the following:
we are ordering the
vals
column by the string length in ascending order, and then fetching the first row viaLIMIT 1
.even though we have the string
'dd'
is also just as short, the query only fetches a single shortest string. See the latter section to get all shortest strings.
Getting the longest string
Similarly, to get the longest string in a PySpark column, use the following SQL query:
+----+|vals|+----+|eeee|+----+
Here, the query is the same as the case for getting the shortest string except that we now order by descending order (DESC
).
Getting all shortest strings
To get all the shortest strings in the vals
column:
q = 'SELECT * FROM my_vals WHERE length(vals) = (SELECT min(length(vals)) FROM my_vals)'
+----+|vals|+----+| bb|| dd|+----+
Here, we are using the subquery to first fetch the minimum length of strings in the column:
+-----------------+|min(length(vals))|+-----------------+| 2|+-----------------+
Getting all longest strings
To get all the longest strings in the vals
column:
q = 'SELECT * FROM my_vals WHERE length(vals) = (SELECT max(length(vals)) FROM my_vals)'
+----+|vals|+----+|cccc||eeee|+----+
Again, our subquery first fetches the maximum length of the strings in the vals
column:
+-----------------+|max(length(vals))|+-----------------+| 4|+-----------------+