df = spark.createDataFrame([("Alex", 20), ("Alex", 20), ("Bob", 30), ("Cathy", 40)], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Alex| 20|
|  Bob| 30|
|Cathy| 40|
+-----+---+

Suppose the other PySpark DataFrame is:


        
        
            
                
                
                    df_other = spark.createDataFrame([("Alex", 20), ("Alex", 20), ("David", 80), ("Eric", 80)], ["name", "age"])
df_other.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Alex| 20|
|David| 80|
| Eric| 80|
+-----+---+

Here, note the following:

the only matching row is Alex's row
Alex's row appears twice in both df and df_other

Getting rows that also exist in other PySpark DataFrame while preserving duplicates

To get rows that also exist in other PySpark DataFrame while preserving duplicates:


        
        
            
                
                
                    df_res = df.intersectAll(df_other)
df_res.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 20|
|Alex| 20|
+----+---+

Note the following:

Alex's row is duplicated because Alex's row appears twice in df and df_other each.
if Alex's row only appeared once in one DataFrame but appeared multiple times in another, Alex's row will only be included once in the resulting DataFrame.
if you want to include duplicating rows only once, then use the intersect(~) method instead.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.intersectAll.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!