PySpark DataFrame | union method
Start your free 7-days trial now!
PySpark DataFrame's union(~)
method concatenates two DataFrames vertically based on column positions.
Note the following:
the two DataFrames must have the same number of columns
the DataFrames will be vertically concatenated based on the column position rather than the labels. See examples below for clarification.
Parameters
1. other
| PySpark DataFrame
The other DataFrame with which to vertically concatenate with.
Return Value
A PySpark DataFrame (pyspark.sql.dataframe.DataFrame
).
Examples
Concatenating PySpark DataFrames vertically based on column position
Consider the following two PySpark DataFrames:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|+-----+---+
The other DataFrame:
+----+---+|name|age|+----+---+|Alex| 25||Doge| 30||Eric| 50|+----+---+
To concatenate the two DataFrames:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|| Alex| 25|| Doge| 30|| Eric| 50|+-----+---+
Union is based on column position
Consider the following PySpark DataFrames:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|+-----+---+
The other PySpark DataFrame has a different column called salary
:
+----+------+|name|salary|+----+------+|Alex| 250||Doge| 200||Eric| 100|+----+------+
Joining the two DataFrames using union(~)
yields:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|| Alex|250|| Doge|200|| Eric|100|+-----+---+
Notice how even though the two DataFrames had separate column labels, the method still concatenated them. This is because the concatenation is based on the column positions and so the labels play no role here. You should be wary of this behaviour because the union(~)
method may yield incorrect DataFrames like the one above without throwing an error!