Trimming specific characters in PySpark DataFrame
Start your free 7-days trial now!
To trim specific leading and trailing characters in PySpark DataFrame, use the regexp_replace(~)
function.
As an example, consider the following PySpark DataFrame:
+----+|vals|+----+| ##A|| B##|| #C#|+----+
Trimming specific leading characters
To remove the leading #
characters, use the regexp_replace(~)
function:
from pyspark.sql import functions as F
+--------+|new_vals|+--------+| A|| B##|| C#|+--------+
The arguments of regexp_replace(~)
are as follows (in order):
the label of the column to perform the replace operations
the regular expression (regex) to match substrings that are to be replaced
the string to replace the matched regex (an empty string
''
essentially means removal of matched substring)
In this case, the regex we match is ^#+
. The ^
is a special character in regex which matches the beginning of the string, that is, ^
matches leading characters. The +
is another special character in regex that matches one or more of the preceding character (#
).
Note that we are using the alias(~)
function here to assign a label to the column returned by regexp_repalce(~)
method.
Trimming specific trailing characters
Similarly, to remove specific trailing characters, use the regexp_replace(~)
function with the regex #+$
:
# Replace the substrings matched by the regex #+$# with an empty string '' in the vals column
+--------+|new_vals|+--------+| ##A|| B|| #C|+--------+
Here, the $
in #+$
matches the end of the string.
Trimming specific leading and trailing characters
Consider the same PySpark DataFrame as before:
+----+|vals|+----+| ##A|| B##|| #C#|+----+
Again, to remove specific leading and trailing characters, use regexp_replace(~)
:
from pyspark.sql import functions as F
+--------+|new_vals|+--------+| A|| B|| C|+--------+
Here, the pipeline character |
in the regex ^#+|#+$
represents an OR
. This means that we are matching leading #
characters (^#+
) or the trailing #
characters (#+$
).
Trimming specific substrings
Consider the following PySpark DataFrame:
+-----+| vals|+-----+| #@A|| B#@||#C#@D|+-----+
To trim substrings from a PySpark DataFrame, again use the regexp_replace(~)
function:
from pyspark.sql import functions as F
+--------+|new_vals|+--------+| A|| B|| #C#@D|+--------+
Here, the parentheses in our regex ^(#@)|(#@)$
allows us to group characters together to form a substring. For instance, the regex ^(#@)
matches the leading substring #@
.