Removing rows that contain specific substring in PySpark DataFrame
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+----+|col1|col2|+----+----+| A| a|| B#A| b|| C##| c|+----+----+
Using the contains method to remove rows with certain substrings
To remove rows that contain specific substring (e.g. '#') in PySpark DataFrame, use the contains(~) method:
Here, we are first obtaining a boolean mask using the F.col('col1').contains('#') method:
We then reverse the boolean using the ~ operator:
Finally, we use the filter(~) method to extract rows that correspond to True in this boolean mask:
Using the rlike method to remove rows with values that match some regular expression
Once again, consider the same PySpark DataFrame as above:
+----+----+|col1|col2|+----+----+| A| a|| B#A| b|| C##| c|+----+----+
To remove rows where some string values match a regular expression, use the rlike(~) method:
Here, the rlike(~) method takes in as argument a regular expression (regex). The $ in regex #$ is a special character that matches the end of the string, that is, #$ matches the character # that occurs at the end of the string.
Note that just like the contains(~) method, rlike(~) also returns a boolean mask:
The rlike(~) method is equivalent to SQL's RLIKE clause.
Using the like method to remove rows that contain string values matching some pattern
Again, consider the same PySpark DataFrame as before:
+----+----+|col1|col2|+----+----+| A| a|| B#A| b|| C##| c|+----+----+
We could use the like(~) method to remove rows that contain string values matching some patterns:
Here, the special character % is a wildcard and matches any character. %# therefore matches all strings that end with #.
like(~) method is equivalent to SQL's LIKE clause.