PySpark Column | withField method
Start your free 7-days trial now!
PySpark Column's withField(~)
method is used to either add or update a nested field value.
Parameters
1. fieldName
| string
The name of the nested field.
2. col
| Column
The new column value to add or update with.
Return Value
A PySpark Column (pyspark.sql.column.Column
).
Examples
Consider the following PySpark DataFrame with nested rows:
from pyspark.sql import Rowdata = [ Row(name="Alex", age=20, friend=Row(name="Bob",age=30)), Row(name="Cathy", age=40, friend=Row(name="Doge",age=40))]
+-----+---+----------+| name|age| friend|+-----+---+----------+| Alex| 20| {Bob, 30}||Cathy| 40|{Doge, 40}|+-----+---+----------+
Here, the friend
column contains nested Row
, which can be confirmed by printing out the schema:
root |-- name: string (nullable = true) |-- age: long (nullable = true) |-- friend: struct (nullable = true) | |-- name: string (nullable = true) | |-- age: long (nullable = true)
Updating nested rows in PySpark
To update nested rows, use the withField(~)
method like so:
import pyspark.sql.functions as F
+-----+---+---------+| name|age| friend|+-----+---+---------+| Alex| 20|{BOB, 30}||Cathy| 40|{BOB, 40}|+-----+---+---------+
Note the following:
we are updating the
name
field of thefriend
column with a constant string"BOB"
.the
F.lit("BOB")
returns aColumn
object whose values are filled with the string"BOB"
.the
withColumn(~)
method replaces thefriend
column of our DataFrame with the updated column returned bywithField(~)
.
Updating nested rows using original values in PySpark
To update nested rows using original values, use the withField(~)
method like so:
+-----+---+----------+| name|age| friend|+-----+---+----------+| Alex| 20| {BOB, 30}||Cathy| 40|{DOGE, 40}|+-----+---+----------+
Here, we are uppercasing the name
field of the friend
column using F.upper("friend.name")
, which returns a Column
object.
Adding new field values in nested rows in PySpark
The withField(~)
column can also be used to add new field values in nested rows:
+-----+---+----------------+| name|age| friend|+-----+---+----------------+| Alex| 20| {Bob, 30, BOB}||Cathy| 40|{Doge, 40, DOGE}|+-----+---+----------------+
Now, checking our schema of our new PySpark DataFrame:
root |-- name: string (nullable = true) |-- age: long (nullable = true) |-- friend: struct (nullable = true) | |-- name: string (nullable = true) | |-- age: long (nullable = true) | |-- upper_name: string (nullable = true)
We can see the new nested field upper_name
has been added!