from pyspark.sql import functions as F
df_new = df.withColumn("name", F.regexp_replace("name", "le", ""))
df_new.show()
                
            
            +----+---+
|name|age|
+----+---+
|  Ax| 25|
| Bob| 30|
+----+---+

Here, note the following:

we are using the PySpark SQL function regexp_replace(~) to replace the substring "le" with an empty string, which is equivalent to removing the substring "le".
the second argument of regexp_replace(~) method is a regular expression, which means that certain regex characters such as [ and ( will be treated differently. For instance, the following will throw an error:
from pyspark.sql import functions as F df_new = df.withColumn("name", F.regexp_replace("name", "[le", "")) df_new.show() java.util.regex.PatternSyntaxException: Unclosed character class near index 2
To avoid special treatment of regex characters, escape them using backslash \:
df_new = df.withColumn("name", F.regexp_replace("name", "\[le", ""))
Finally, we use the PySpark DataFrame's withColumn(~) method to return a new DataFrame with the updated name column.

Using a regular expression to drop substrings

The fact that the regexp_replace(~) method allows you to match substrings using regular expression gives you a lot of flexibility in which substrings are to be dropped. For instance, consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([['Alex', 10], ['Mile', 30]], ['name', 'age'])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 10|
|Mile| 30|
+----+---+

To drop the substring 'le' that only occurs at the end of the string:


        
        
            
                
                
                    df.select(F.regexp_replace(df.name, 'le$', '').alias('new_name')).show()
                
            
            +--------+
|new_name|
+--------+
|    Alex|
|      Mi|
+--------+

Here, the regular expression character $ matches only trailing occurrences of 'le'.

Removing a list of substrings using regexp_replace method

Again, consider the same PySpark DataFrame as above:


        
        
            
                
                
                    df = spark.createDataFrame([["Alex", 25], ["Bob", 30]], ['name', 'age'])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

To remove a list of substrings, we can again take advantage of the fact that regexp_replace() uses regular expression to match substrings that will be replaced:


        
        
            
                
                
                    from pyspark.sql import functions as F
substr_to_remove = ["le","B"]
regex = "|".join(substr_to_remove)
df_new = df.withColumn("name", F.regexp_replace("name", regex, ""))
df_new.show()
                
            
            +----+---+
|name|age|
+----+---+
|  Ax| 25|
|  ob| 30|
+----+---+

Here, we are constructing a regex string using the OR operator (|):


        
        
            
                
                
                    substr_to_remove = ["le","B"]
regex = "|".join(substr_to_remove)
regex
                
            
            'le|B'

The regexp_replace(~) method will then replace either the substring "le" or "B" with an empty string:


        
        
            
                
                
                    df_new = df.withColumn("name", F.regexp_replace("name", regex, ""))
df_new.show()
                
            
            +----+---+
|name|age|
+----+---+
|  Ax| 25|
|  ob| 30|
+----+---+