PySpark Column | dropFields method
Start your free 7-days trial now!
PySpark Column's dropFields(~) method returns a new PySpark Column object with the specified nested fields removed.
Parameters
1. *fieldNames | string
The nested fields to remove.
Return Value
A PySpark Column.
Examples
Consider the following PySpark DataFrame with some nested Rows:
data = [ Row(name="Alex", age=20, friend=Row(name="Bob",age=30,height=150)), Row(name="Cathy", age=40, friend=Row(name="Doge",age=40,height=180))]
+-----+---+---------------+| name|age| friend|+-----+---+---------------+| Alex| 20| {Bob, 30, 150}||Cathy| 40|{Doge, 40, 180}|+-----+---+---------------+
The schema of this PySpark DataFrame is as follows:
root |-- name: string (nullable = true) |-- age: long (nullable = true) |-- friend: struct (nullable = true) | |-- name: string (nullable = true) | |-- age: long (nullable = true) | |-- height: long (nullable = true)
Dropping certain nested fields in PySpark Column
To remove the age and height fields under friend, use the dropFields(~) method:
+-----+---+------+| name|age|friend|+-----+---+------+| Alex| 20| {Bob}||Cathy| 40|{Doge}|+-----+---+------+
Here, note the following:
we are using the
withColumn(~)method to update thefriendcolumn with the new column returned bydropFields(~).
The schema of this updated PySpark DataFrame is as follows:
root |-- name: string (nullable = true) |-- age: long (nullable = true) |-- friend: struct (nullable = true) | |-- name: string (nullable = true)
Notice how the age and height fields are no longer present under friend.
Even if the nested field you wish to delete does not exist, no error will be thrown:
updated_col = df["friend"].dropFields("ZZZZZZZZZ")
+-----+---+---------------+| name|age| friend|+-----+---+---------------+| Alex| 20| {Bob, 30, 150}||Cathy| 40|{Doge, 40, 180}|+-----+---+---------------+
Here, the nested field "ZZZZZZZZZ" obviously does not exist but no error was thrown.