df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
+-----+---+

The other PySpark DataFrame:


        
        
            
                
                
                    df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show()
                
            
            +----+------+
|name|salary|
+----+------+
|Alex|   250|
| Bob|   200|
|Doge|   100|
+----+------+

Inner join

For inner join, all rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame:


        
        
            
                
                
                    df1.join(df2, on="name", how="inner").show()   # how="cross" also works
                
            
            +----+---+------+
|name|age|salary|
+----+---+------+
|Alex| 20|   250|
| Bob| 24|   200|
+----+---+------+

Left join and left-outer join

For left join (or left-outer join), all rows in the left DataFrame and matching rows in the right DataFrame will be present in the resulting DataFrame:


        
        
            
                
                
                    df1.join(df2, on="name", how="left").show()   # how="left_outer" works
                
            
            +-----+---+------+
| name|age|salary|
+-----+---+------+
| Alex| 20|   250|
|  Bob| 24|   200|
|Cathy| 22|  null|
+-----+---+------+

Right join and right-outer join

For right (right-outer) join, all rows in the right DataFrame and matching rows in the left DataFrame will be present in the resulting DataFrame:


        
        
            
                
                
                    df1.join(df2, on="name", how="right").show()   # how="right_outer" also works
                
            
            +----+----+------+
|name| age|salary|
+----+----+------+
|Alex|  20|   250|
| Bob|  24|   200|
|Doge|null|   100|
+----+----+------+

Performing outer join

Consider the same PySpark DataFrames as before:


        
        
            
                
                
                    df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
+-----+---+

This is the other PySpark DataFrame:


        
        
            
                
                
                    df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show()
                
            
            +----+------+
|name|salary|
+----+------+
|Alex|   250|
| Bob|   200|
|Doge|   100|
+----+------+

For outer join, both the left and right DataFrames will be present:


        
        
            
                
                
                    df1.join(df2, on="name", how="outer").show()   # how="full" or "fullouter" also works
                
            
            +-----+----+------+
| name| age|salary|
+-----+----+------+
| Alex|  20|   250|
|  Bob|  24|   200|
|Cathy|  22|  null|
| Doge|null|   100|
+-----+----+------+

Performing left-anti and left-semi joins

Consider the same PySpark DataFrames as before:


        
        
            
                
                
                    df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
+-----+---+

This is the other DataFrame:


        
        
            
                
                
                    df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show()
                
            
            +----+------+
|name|salary|
+----+------+
|Alex|   250|
| Bob|   200|
|Doge|   100|
+----+------+

Left anti-join

For left anti-join, all rows in the left DataFrame that are not present in the right DataFrame will be in the resulting DataFrame:


        
        
            
                
                
                    df1.join(df2, on="name", how="left_anti").show()   # how="leftanti" also works
                
            
            +-----+---+
| name|age|
+-----+---+
|Cathy| 22|
+-----+---+

Left semi-join

Left semi-join is the opposite of left-anti join, that is, all rows in the left DataFrame that are present in the right DataFrame will be in the resulting DataFrame:


        
        
            
                
                
                    df1.join(df2, on="name", how="left_semi").show()   # how="leftsemi" also works
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 24|
+----+---+

Performing join on different column names

Up to now, we have specified the join key using the on parameter. Let's now consider the case when the join keys have different labels. Suppose one DataFrame is as follows:


        
        
            
                
                
                    df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
+-----+---+

Suppose the other DataFrame is as follows:


        
        
            
                
                
                    df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["NAME", "salary"])
df2.show()
                
            
            +----+------+
|NAME|salary|
+----+------+
|Alex|   250|
| Bob|   200|
|Doge|   100|
+----+------+

We can join using name of df1 and NAME of df2 like so:


        
        
            
                
                
                    cond = [df1["name"] == df2["NAME"]]
df1.join(df2, on=cond, how="inner").show()
                
            
            +----+---+----+------+
|name|age|NAME|salary|
+----+---+----+------+
|Alex| 20|Alex|   250|
| Bob| 24| Bob|   200|
+----+---+----+------+

Here, we can supply multiple join keys since on accepts a list.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!