Removing rows that contain specific substring in PySpark DataFrame
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+----+|col1|col2|+----+----+| A| a|| B#A| b|| C##| c|+----+----+
Using the contains method to remove rows with certain substrings
To remove rows that contain specific substring (e.g. '#'
) in PySpark DataFrame, use the contains(~)
method:
Here, we are first obtaining a boolean mask using the F.col('col1').contains('#')
method:
We then reverse the boolean using the ~
operator:
Finally, we use the filter(~)
method to extract rows that correspond to True
in this boolean mask:
Using the rlike method to remove rows with values that match some regular expression
Once again, consider the same PySpark DataFrame as above:
+----+----+|col1|col2|+----+----+| A| a|| B#A| b|| C##| c|+----+----+
To remove rows where some string values match a regular expression, use the rlike(~)
method:
Here, the rlike(~)
method takes in as argument a regular expression (regex). The $
in regex #$
is a special character that matches the end of the string, that is, #$
matches the character #
that occurs at the end of the string.
Note that just like the contains(~)
method, rlike(~)
also returns a boolean mask:
The rlike(~)
method is equivalent to SQL's RLIKE
clause.
Using the like method to remove rows that contain string values matching some pattern
Again, consider the same PySpark DataFrame as before:
+----+----+|col1|col2|+----+----+| A| a|| B#A| b|| C##| c|+----+----+
We could use the like(~)
method to remove rows that contain string values matching some patterns:
Here, the special character %
is a wildcard and matches any character. %#
therefore matches all strings that end with #
.
like(~)
method is equivalent to SQL's LIKE
clause.