from pyspark.sql import functions as F
df.select(F.regexp_replace('vals', '^#+', '').alias('new_vals')).show()
                
            
            +--------+
|new_vals|
+--------+
|       A|
|     B##|
|      C#|
+--------+

The arguments of regexp_replace(~) are as follows (in order):

the label of the column to perform the replace operations
the regular expression (regex) to match substrings that are to be replaced
the string to replace the matched regex (an empty string '' essentially means removal of matched substring)

In this case, the regex we match is ^#+. The ^ is a special character in regex which matches the beginning of the string, that is, ^ matches leading characters. The + is another special character in regex that matches one or more of the preceding character (#).

Note that we are using the alias(~) function here to assign a label to the column returned by regexp_repalce(~) method.

Trimming specific trailing characters

Similarly, to remove specific trailing characters, use the regexp_replace(~) function with the regex #+$:


        
        
            
                
                
                    # Replace the substrings matched by the regex #+$
# with an empty string '' in the vals column
df.select(F.regexp_replace('vals', '#+$', '').alias('new_vals')).show()
                
            
            +--------+
|new_vals|
+--------+
|     ##A|
|       B|
|      #C|
+--------+

Here, the $ in #+$ matches the end of the string.

Trimming specific leading and trailing characters

Consider the same PySpark DataFrame as before:


        
        
            
                
                
                    df = spark.createDataFrame([['##A'],['B##'],['#C#']], ['vals'])
df.show()
                
            
            +----+
|vals|
+----+
| ##A|
| B##|
| #C#|
+----+

Again, to remove specific leading and trailing characters, use regexp_replace(~):


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.regexp_replace('vals', '^#+|#+$', '').alias('new_vals')).show()
                
            
            +--------+
|new_vals|
+--------+
|       A|
|       B|
|       C|
+--------+

Here, the pipeline character | in the regex ^#+|#+$ represents an OR. This means that we are matching leading # characters (^#+) or the trailing # characters (#+$).

Trimming specific substrings

Consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([['#@A'],['B#@'],['#C#@D']], ['vals'])
df.show()
                
            
            +-----+
| vals|
+-----+
|  #@A|
|  B#@|
|#C#@D|
+-----+

To trim substrings from a PySpark DataFrame, again use the regexp_replace(~) function:


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.regexp_replace('vals', '^(#@)|(#@)$', '').alias('new_vals')).show()
                
            
            +--------+
|new_vals|
+--------+
|       A|
|       B|
|   #C#@D|
+--------+

Here, the parentheses in our regex ^(#@)|(#@)$ allows us to group characters together to form a substring. For instance, the regex ^(#@) matches the leading substring #@.

PySpark SQL Functions | regexp_replace method

PySpark SQL Functions' regexp_replace(~) method replaces the matched regular expression with the specified string.

chevron_right