Removing duplicate columns in Pandas DataFrame
Start your free 7-days trial now!
Dropping columns with the same label
Consider the following DataFrame:
df = pd.DataFrame({"A":[3,4],"B":[5,6],"C":[7,8]})df = df.rename(columns={"C":"A"})df
A B A0 3 5 71 4 6 8
Here, we had to use the rename(~)
method since the DataFrame(~)
constructor automatically eliminates columns with duplicate labels.
Keeping first occurrence of duplicates
To drop columns with duplicate labels except the first occurrence, use duplicated(~)
like so:
df.loc[:, ~df.columns.duplicated()]
A B0 3 51 4 6
Explanation
Here, we first fetch the column labels (Index
object) using the columns
property. We then call duplicated(~)
, which returns a NumPy array of booleans where True
indicates the presence of a duplicate label:
df.columns.duplicated()
array([False, False, True])
By default, keep="first"
for duplicated(~)
and so the first occurrence of a non-unique value will be marked as False
.
Dropping duplicate columns is equivalent to keeping all non-duplicate columns, and so we invert the booleans using ~
:
~df.columns.duplicated()
array([ True, True, False])
Finally, we pass this boolean mask into loc
to get the columns that correspond to True
in the mask:
df.loc[:, ~df.columns.duplicated()]
A B0 3 51 4 6
Here, the :
before the comma indicates that we want to fetch all the rows.
Dropping all occurrences of duplicates
Consider the same df
as above:
df
A B A0 3 5 71 4 6 8
To drop all occurrences of duplicates:
df.loc[:, ~df.columns.duplicated(keep=False)]
B0 51 6
Explanation
The only difference between this and the previous case is that we set the parameter keep=False
, and so duplicated(~)
here returns True
for all occurrences of non-unique values (as opposed to the default behaviour of returning False
for the first occurrence):
df.columns.duplicated(keep=False)
array([ True, False, True])
Dropping columns with same values
Consider the following DataFrame:
df = pd.DataFrame({"A":[3,4],"B":[5,6],"C":[3,4]})df
A B C0 3 5 31 4 6 4
Here, columns A
and C
contain the same values.
To drop duplicate columns:
df.T.drop_duplicates().T
A B0 3 51 4 6
By default, keep="first"
for drop_duplicates(~)
, which means that the first occurrence of the duplicates (column A
) is kept. To remove all occurrences instead, set keep=False
.
Explanation
There is no direct way of removing duplicate columns, but Pandas does offer the method drop_duplicates()
, which removes duplicate rows. Therefore, we take the transpose of df
using the T
property:
df.T
0 1A 3 4B 5 6C 3 4
We then call drop_duplicates()
to remove the duplicate rows:
df.T.drop_duplicates()
0 1A 3 4B 5 6
Finally, we take the transpose again to get back the original shape.