PySpark DataFrame | replace method
Start your free 7-days trial now!
PySpark DataFrame's replace(~)
method returns a new DataFrame with certain values replaced. We can also specify which columns to perform replacement in.
Parameters
1. to_replace
| boolean
, number
, string
, list
or dict
| optional
The value to be replaced.
2. value
| boolean
, number
, string
or None
| optional
The new value to replace to_replace
.
3. subset
| list
| optional
The columns to focus on. By default, all columns will be checked for replacement.
Return Value
PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 25|| Bob| 30||Cathy| 40|+-----+---+
Replacing values for a single column
To replace the value "Alex"
with "ALEX"
in the name
column:
+-----+---+| name|age|+-----+---+| ALEX| 25|| Bob| 30||Cathy| 40|+-----+---+
Note that a new PySpark DataFrame is returned, and the original DataFrame is kept intact.
Replacing multiple values for a single column
To replace the value "Alex"
with "ALEX"
and "Bob"
with "BOB"
in the name
column:
+-----+---+| name|age|+-----+---+| ALEX| 25|| BOB| 30||Cathy| 40|+-----+---+
Replacing multiple values with a single value
To replace the values "Alex"
and "Bob"
with "SkyTowner"
in the name
column:
+---------+---+| name|age|+---------+---+|SkyTowner| 25||SkyTowner| 30|| Cathy| 40|+---------+---+
Replacing values in the entire DataFrame
To replace the values "Alex"
and "Bob"
with "SkyTowner"
in the entire DataFrame:
+---------+---+| name|age|+---------+---+|SkyTowner| 25||SkyTowner| 30|| Cathy| 40|+---------+---+
Here, notice how we did not specify the subset
option.
Replacing values using a dictionary
To replace "Alex"
with "ALEX"
and "Bob"
with "BOB"
in the name
column using a dictionary:
Mixed-type replacements are not allowed. For instance, the following is not allowed:
df.replace({ "Alex": "ALEX", 30: 99,}, subset=["name","age"]).show()
ValueError: Mixed type replacements are not supported
Here, we are performing one string replacement and one integer replacement. Since this is a mix-typed replacement, PySpark throws an error. To avoid this error, perform the two replacements individually.
Replacing multiple values in multiple columns
Consider the following DataFrame:
+----+----+|col1|col2|+----+----+| aa| AA|| bb| BB|+----+----+
To replace certain values in col1
and col2
:
+----+----+|col1|col2|+----+----+| aa| @@@|| ###| BB|+----+----+