PySpark DataFrame | intersectAll method
Start your free 7-days trial now!
PySpark DataFrame's intersectAll(~)
method returns a new PySpark DataFrame with rows that also exist in the other PySpark DataFrame. Unlike intersect(~)
, the intersectAll(~)
method preserves duplicates.
The intersectAll(~)
method is identical to to the INTERSECT ALL
statement in SQL.
Parameters
1. other
| PySpark DataFrame
The other PySpark DataFrame.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
df = spark.createDataFrame([("Alex", 20), ("Alex", 20), ("Bob", 30), ("Cathy", 40)], ["name", "age"])
+-----+---+| name|age|+-----+---+| Alex| 20|| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
Suppose the other PySpark DataFrame is:
df_other = spark.createDataFrame([("Alex", 20), ("Alex", 20), ("David", 80), ("Eric", 80)], ["name", "age"])
+-----+---+| name|age|+-----+---+| Alex| 20|| Alex| 20||David| 80|| Eric| 80|+-----+---+
Here, note the following:
the only matching row is
Alex
's rowAlex
's row appears twice in bothdf
anddf_other
Getting rows that also exist in other PySpark DataFrame while preserving duplicates
To get rows that also exist in other PySpark DataFrame while preserving duplicates:
df_res = df.intersectAll(df_other)
+----+---+|name|age|+----+---+|Alex| 20||Alex| 20|+----+---+
Note the following:
Alex
's row is duplicated becauseAlex
's row appears twice indf
anddf_other
each.if
Alex
's row only appeared once in one DataFrame but appeared multiple times in another,Alex
's row will only be included once in the resulting DataFrame.if you want to include duplicating rows only once, then use the
intersect(~)
method instead.