Removing duplicate rows in Pandas DataFrame
Start your free 7-days trial now!
To remove duplicate rows from a Pandas DataFrame, use the drop_duplicates(~)
method.
Removing duplicate rows where a single column value is duplicate
Consider the following DataFrame:
df = pd.DataFrame({"A":[3,4,3,3],"B":[6,7,6,9]})df
A B0 3 61 4 72 3 63 3 9
Keeping the first occurrence
To remove duplicate rows where the value for column A
is duplicate:
df.drop_duplicates(subset=["A"]) # keep="first"
A B0 3 61 4 7
By default, keep="first"
, which means that the first occurrence of the duplicate row will be kept. This is why row 0
was kept while rows 2
and 3
were removed.
By default, inplace=False
, which means that the method returns a new DataFrame and the original DataFrame is kept intact. To directly modify the original DataFrame, set inplace=True
.
Keeping the last occurrence
To keep only the last occurrence of duplicate rows, set keep="last"
:
df.drop_duplicates(subset=["A"], keep="last")
A B1 4 73 3 9
Removing all occurrences
To remove all occurrences of duplicate rows, set keep=False
:
df.drop_duplicates(subset=["A"], keep=False)
A B1 4 7
Removing duplicate rows where all column values are duplicate
Consider the same DataFrame as before:
df = pd.DataFrame({"A":[3,4,3,3],"B":[6,7,6,9]})df
A B0 3 61 4 72 3 63 3 9
Keeping the first occurrence
To remove duplicate rows where the value for all the columns match:
df.drop_duplicates() # keep="first"
A B0 3 61 4 73 3 9
By default, keep="first"
, which means that the first occurrence of the duplicate row will be kept. This is why row 0
was kept while row 2
was removed.
Keeping the last occurrence
To remove all occurrences of duplicate rows except the last, set keep="last"
:
df.drop_duplicates(keep="last")
A B1 4 72 3 63 3 9
Removing all occurrences
To remove all occurrences of duplicate rows, set keep=False
:
df.drop_duplicates(keep=False)
A B1 4 73 3 9