Removing duplicate rows in Pandas DataFrame
Start your free 7-days trial now!
To remove duplicate rows from a Pandas DataFrame, use the drop_duplicates(~) method.
Removing duplicate rows where a single column value is duplicate
Consider the following DataFrame:
        
        
            
                
                
                    df = pd.DataFrame({"A":[3,4,3,3],"B":[6,7,6,9]})df
                
            
               A  B0  3  61  4  72  3  63  3  9
        
    Keeping the first occurrence
To remove duplicate rows where the value for column A is duplicate:
        
        
            
                
                
                    df.drop_duplicates(subset=["A"])   # keep="first"
                
            
               A  B0  3  61  4  7
        
    By default, keep="first", which means that the first occurrence of the duplicate row will be kept. This is why row 0 was kept while rows 2 and 3 were removed.
By default, inplace=False, which means that the method returns a new DataFrame and the original DataFrame is kept intact. To directly modify the original DataFrame, set inplace=True.
Keeping the last occurrence
To keep only the last occurrence of duplicate rows, set keep="last":
        
        
            
                
                
                    df.drop_duplicates(subset=["A"], keep="last")
                
            
               A  B1  4  73  3  9
        
    Removing all occurrences
To remove all occurrences of duplicate rows, set keep=False:
        
        
            
                
                
                    df.drop_duplicates(subset=["A"], keep=False)
                
            
               A  B1  4  7
        
    Removing duplicate rows where all column values are duplicate
Consider the same DataFrame as before:
        
        
            
                
                
                    df = pd.DataFrame({"A":[3,4,3,3],"B":[6,7,6,9]})df
                
            
               A  B0  3  61  4  72  3  63  3  9
        
    Keeping the first occurrence
To remove duplicate rows where the value for all the columns match:
        
        
            
                
                
                    df.drop_duplicates()   # keep="first"
                
            
               A  B0  3  61  4  73  3  9
        
    By default, keep="first", which means that the first occurrence of the duplicate row will be kept. This is why row 0 was kept while row 2 was removed.
Keeping the last occurrence
To remove all occurrences of duplicate rows except the last, set keep="last":
        
        
            
                
                
                    df.drop_duplicates(keep="last")
                
            
               A  B1  4  72  3  63  3  9
        
    Removing all occurrences
To remove all occurrences of duplicate rows, set keep=False:
        
        
            
                
                
                    df.drop_duplicates(keep=False)
                
            
               A  B1  4  73  3  9
        
      