Removing substring in column values of PySpark DataFrame
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Removing substring using the regexp_replace method
To remove the substring "le"
from the name
column in our PySpark DataFrame, use the regexp_replace(~)
method:
from pyspark.sql import functions as F
+----+---+|name|age|+----+---+| Ax| 25|| Bob| 30|+----+---+
Here, note the following:
we are using the PySpark SQL function
regexp_replace(~)
to replace the substring"le"
with an empty string, which is equivalent to removing the substring"le"
.the second argument of
regexp_replace(~)
method is a regular expression, which means that certain regex characters such as[
and(
will be treated differently. For instance, the following will throw an error:from pyspark.sql import functions as Fjava.util.regex.PatternSyntaxException: Unclosed character class near index 2To avoid special treatment of regex characters, escape them using backslash
\
:Finally, we use the PySpark DataFrame's
withColumn(~)
method to return a new DataFrame with the updatedname
column.
Using a regular expression to drop substrings
The fact that the regexp_replace(~)
method allows you to match substrings using regular expression gives you a lot of flexibility in which substrings are to be dropped. For instance, consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 10||Mile| 30|+----+---+
To drop the substring 'le'
that only occurs at the end of the string:
+--------+|new_name|+--------+| Alex|| Mi|+--------+
Here, the regular expression character $
matches only trailing occurrences of 'le'
.
Removing a list of substrings using regexp_replace method
Again, consider the same PySpark DataFrame as above:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
To remove a list of substrings, we can again take advantage of the fact that regexp_replace()
uses regular expression to match substrings that will be replaced:
from pyspark.sql import functions as Fsubstr_to_remove = ["le","B"]
+----+---+|name|age|+----+---+| Ax| 25|| ob| 30|+----+---+
Here, we are constructing a regex string using the OR
operator (|
):
'le|B'
The regexp_replace(~)
method will then replace either the substring "le"
or "B"
with an empty string:
+----+---+|name|age|+----+---+| Ax| 25|| ob| 30|+----+---+