PySpark DataFrame | foreach method
Start your free 7-days trial now!
PySpark DataFrame's foreach(~)
method loops over each row of the DataFrame as a Row
object and applies the given function to the row.
The following are some limitations of foreach(~)
:
the
foreach(~)
method in Spark is invoked in the worker nodes instead of the Driver program. This means that if we perform aprint(~)
inside our function, we will not be able to see the printed results in our session or notebook because the results are printed in the worker node instead.rows are read-only and so you cannot update values of the rows.
Given these limitations, the foreach(~)
method is mainly used for logging some information about each row to the local machine or to an external database.
Parameters
1. f
| function
The function to apply to each row (Row
) of the DataFrame.
Return Value
Nothing is returned.
Examples
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 20|| Bob| 30|+----+---+
To iterate over each row and apply some custom function:
Here, the row.name
is printed in the worker nodes so you would not see any output in the driver program.