data = [
    Row(name="Alex", age=20, friend=Row(name="Bob",age=30,height=150)),
    Row(name="Cathy", age=40, friend=Row(name="Doge",age=40,height=180))
]
df = spark.createDataFrame(data)
df.show()
                
            
            +-----+---+---------------+
| name|age|         friend|
+-----+---+---------------+
| Alex| 20| {Bob, 30, 150}|
|Cathy| 40|{Doge, 40, 180}|
+-----+---+---------------+

The schema of this PySpark DataFrame is as follows:


        
        
            
                
                
                    df.printSchema()
                
            
            root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- friend: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- height: long (nullable = true)

Dropping certain nested fields in PySpark Column

To remove the age and height fields under friend, use the dropFields(~) method:


        
        
            
                
                
                    updated_col = df["friend"].dropFields("age", "height")
df_new = df.withColumn("friend", updated_col)
df_new.show()
                
            
            +-----+---+------+
| name|age|friend|
+-----+---+------+
| Alex| 20| {Bob}|
|Cathy| 40|{Doge}|
+-----+---+------+

Here, note the following:

we are using the withColumn(~) method to update the friend column with the new column returned by dropFields(~).

The schema of this updated PySpark DataFrame is as follows:


        
        
            
                
                
                    df_new.printSchema()
                
            
            root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- friend: struct (nullable = true)
 |    |-- name: string (nullable = true)

Notice how the age and height fields are no longer present under friend.

NOTE

Even if the nested field you wish to delete does not exist, no error will be thrown:


        
        
            
                
                
                    updated_col = df["friend"].dropFields("ZZZZZZZZZ")
df_new = df.withColumn("friend", updated_col)
df_new.show()
                
            
            +-----+---+---------------+
| name|age|         friend|
+-----+---+---------------+
| Alex| 20| {Bob, 30, 150}|
|Cathy| 40|{Doge, 40, 180}|
+-----+---+---------------+

Here, the nested field "ZZZZZZZZZ" obviously does not exist but no error was thrown.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.Column.dropFields.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!