PySpark Column | dropFields method
Start your free 7-days trial now!
PySpark Column's dropFields(~)
method returns a new PySpark Column
object with the specified nested fields removed.
Parameters
1. *fieldNames
| string
The nested fields to remove.
Return Value
A PySpark Column.
Examples
Consider the following PySpark DataFrame with some nested Rows:
data = [ Row(name="Alex", age=20, friend=Row(name="Bob",age=30,height=150)), Row(name="Cathy", age=40, friend=Row(name="Doge",age=40,height=180))]
+-----+---+---------------+| name|age| friend|+-----+---+---------------+| Alex| 20| {Bob, 30, 150}||Cathy| 40|{Doge, 40, 180}|+-----+---+---------------+
The schema of this PySpark DataFrame is as follows:
root |-- name: string (nullable = true) |-- age: long (nullable = true) |-- friend: struct (nullable = true) | |-- name: string (nullable = true) | |-- age: long (nullable = true) | |-- height: long (nullable = true)
Dropping certain nested fields in PySpark Column
To remove the age
and height
fields under friend
, use the dropFields(~)
method:
+-----+---+------+| name|age|friend|+-----+---+------+| Alex| 20| {Bob}||Cathy| 40|{Doge}|+-----+---+------+
Here, note the following:
we are using the
withColumn(~)
method to update thefriend
column with the new column returned bydropFields(~)
.
The schema of this updated PySpark DataFrame is as follows:
root |-- name: string (nullable = true) |-- age: long (nullable = true) |-- friend: struct (nullable = true) | |-- name: string (nullable = true)
Notice how the age
and height
fields are no longer present under friend
.
Even if the nested field you wish to delete does not exist, no error will be thrown:
updated_col = df["friend"].dropFields("ZZZZZZZZZ")
+-----+---+---------------+| name|age| friend|+-----+---+---------------+| Alex| 20| {Bob, 30, 150}||Cathy| 40|{Doge, 40, 180}|+-----+---+---------------+
Here, the nested field "ZZZZZZZZZ"
obviously does not exist but no error was thrown.