Getting rows that are not in other DataFrame in Pandas
Start your free 7-days trial now!
Consider the following DataFrames:
df1 = pd.DataFrame({"A":[3,4],"B":[5,6]})df2 = pd.DataFrame({"C":[3,8],"D":[5,9]})
A B | C D0 3 5 | 0 3 51 4 6 | 1 8 9
Here, the first row of each DataFrame has the same entries.
Solution
To fetch all the rows in df1
that do not exist in df2
:
df_merged = df1.merge(df2, how="left", left_on=["A","B"], right_on=["C","D"], indicator=True)df_merged.query("_merge == 'left_only'")[["A","B"]]
A B1 4 6
Explanation
Here, we are are first performing a left join on all columns of df1
and df2
:
df_merged = df1.merge(df2, how="left", left_on=["A","B"], right_on=["C","D"], indicator=True)df_merged
A B C D _merge0 3 5 3.0 5.0 both1 4 6 NaN NaN left_only
The indicate=True
means that we want to append the _merge
column, which tells us the type of join performed; both
indicates that a match was found, whereas left_only
means that no match was found.
We then use the query(~)
method to select rows where _merge=left_only
:
df_merged.query("_merge == 'left_only'")
A B C D _merge1 4 6 NaN NaN left_only
Since we are interested in just the original columns of df1
, we simply extract them using []
syntax:
df_merged.query("_merge == 'left_only'")[["A","B"]]
A B1 4 6
Generalising the solution
As explained above, the solution to get rows that are not in another DataFrame is as follows:
df_merged = df1.merge(df2, how="left", left_on=["A","B"], right_on=["C","D"], indicator=True)df_merged.query("_merge == 'left_only'")[["A","B"]]
A B1 4 6
Instead of explicitly specifying the column labels (e.g. ["A","B"]
), you can pass in a list of columns like so:
df_merged = df1.merge(df2, how="left", left_on=df1.columns.tolist(), right_on=df2.columns.tolist(), indicator=True)df_merged.query("_merge == 'left_only'")[df1.columns]
A B1 4 6