Replacing certain substrings in PySpark DataFrame column
Start your free 7-days trial now!
To replace certain substrings in column values of a PySpark DataFrame, use either PySpark SQL Functions' translate(~)
method or regexp_replace(~)
method.
As an example, consider the following PySpark DataFrame:
+------+| name|+------+|!A@lex|| B#ob|+------+
Replacing certain characters
Suppose we wanted to make the following character replacements:
'!' replaced by '3''@' replaced by '4''#' replaced by '5'
We can use the translate(~)
method like so:
from pyspark.sql import functions as F
+------+| new|+------+|3A4lex|| B5ob|+------+
The withColumn(~)
here is used to replace the name
column with our new column.
Replacing certain substrings
Consider the following PySpark DataFrame:
+-----+| name|+-----+|A@@ex|| @Bob|+-----+
To replace certain substrings, use the regexp_replace(~)
method:
from pyspark.sql import functions as F
+----+|name|+----+|Alex||@Bob|+----+
Here, note the following:
we are replacing the substring
"@@"
with the letter"l"
.
The second argument of regexp_replace(~)
is a regular expression. This means that certain characters such as $
and [
carry special meaning. To replace literal substrings, escape special regex characters using backslash \
(.g. \[
).
Replacing certain substrings using Regex
Consider the following PySpark DataFrame:
+----+|name|+----+|A@ex||@Bob|+----+
To replace @
if it's at the beginning of the string with another string, use regexp_replace(~)
:
from pyspark.sql import functions as F
+----+|name|+----+|A@ex||*Bob|+----+
Here, the regex ^@
represents @
that is at the start of the string.
Replacing certain substrings in multiple columns
The regexp_replace(~)
can only be performed on one column at a time.
For example, consider the following PySpark DataFrame:
+---+---+| A| B|+---+---+| @a| @b|| @c| @d|+---+---+
To replace the substring '@'
with '#'
for columns A
and B
:
str_before = '@'str_after = '#'
+---+---+| A| B|+---+---+| #a| #b|| #c| #d|+---+---+
Related
translate(~)
method replaces the specified characters by the desired characters.regexp_replace(~)
method replaces the matched regular expression with the specified string.