Removing substring in column values of PySpark DataFrame
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Removing substring using the regexp_replace method
To remove the substring "le" from the name column in our PySpark DataFrame, use the regexp_replace(~) method:
from pyspark.sql import functions as F
+----+---+|name|age|+----+---+| Ax| 25|| Bob| 30|+----+---+
Here, note the following:
we are using the PySpark SQL function
regexp_replace(~)to replace the substring"le"with an empty string, which is equivalent to removing the substring"le".the second argument of
regexp_replace(~)method is a regular expression, which means that certain regex characters such as[and(will be treated differently. For instance, the following will throw an error:from pyspark.sql import functions as Fjava.util.regex.PatternSyntaxException: Unclosed character class near index 2To avoid special treatment of regex characters, escape them using backslash
\:Finally, we use the PySpark DataFrame's
withColumn(~)method to return a new DataFrame with the updatednamecolumn.
Using a regular expression to drop substrings
The fact that the regexp_replace(~) method allows you to match substrings using regular expression gives you a lot of flexibility in which substrings are to be dropped. For instance, consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 10||Mile| 30|+----+---+
To drop the substring 'le' that only occurs at the end of the string:
+--------+|new_name|+--------+| Alex|| Mi|+--------+
Here, the regular expression character $ matches only trailing occurrences of 'le'.
Removing a list of substrings using regexp_replace method
Again, consider the same PySpark DataFrame as above:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
To remove a list of substrings, we can again take advantage of the fact that regexp_replace() uses regular expression to match substrings that will be replaced:
from pyspark.sql import functions as Fsubstr_to_remove = ["le","B"]
+----+---+|name|age|+----+---+| Ax| 25|| ob| 30|+----+---+
Here, we are constructing a regex string using the OR operator (|):
'le|B'
The regexp_replace(~) method will then replace either the substring "le" or "B" with an empty string:
+----+---+|name|age|+----+---+| Ax| 25|| ob| 30|+----+---+