df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
+-----+---+

The other DataFrame:


        
        
            
                
                
                    df2 = spark.createDataFrame([["Alex", 25], ["Doge", 30], ["Eric", 50]], ["name", "age"])
df2.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
|Doge| 30|
|Eric| 50|
+----+---+

To concatenate the two DataFrames:


        
        
            
                
                
                    df1.union(df2).show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
| Alex| 25|
| Doge| 30|
| Eric| 50|
+-----+---+

Union is based on column position

Consider the following PySpark DataFrames:


        
        
            
                
                
                    df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
+-----+---+

The other PySpark DataFrame has a different column called salary:


        
        
            
                
                
                    df2 = spark.createDataFrame([["Alex", 250], ["Doge", 200], ["Eric", 100]], ["name", "salary"])
df2.show()
                
            
            +----+------+
|name|salary|
+----+------+
|Alex|   250|
|Doge|   200|
|Eric|   100|
+----+------+

Joining the two DataFrames using union(~) yields:


        
        
            
                
                
                    df1.union(df2).show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
| Alex|250|
| Doge|200|
| Eric|100|
+-----+---+

Notice how even though the two DataFrames had separate column labels, the method still concatenated them. This is because the concatenation is based on the column positions and so the labels play no role here. You should be wary of this behaviour because the union(~) method may yield incorrect DataFrames like the one above without throwing an error!

PySpark DataFrame | unionByName method

PySpark DataFrame's unionByName(~) method concatenates PySpark DataFrames vertically by aligning the column labels.

chevron_right