df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})
df
                
            
               name  group
0  alex   A
1  bob    B
2  cathy  A

Here, the column group holds categorical variables. However, by default, all strings will be interpreted as categorical variables - this is undesirable in this case since we know that name is not a categorical variable:


        
        
            
                
                
                    pd.get_dummies(df)
                
            
               name_alex  name_bob  name_cathy  group_A  group_B
0     1          0          0          1       0
1     0          1          0          0       1
2     0          0          1          1       0

In order to specify that the group column is the categorical variable to one-hot encode, we just need to set the columns parameter, like so:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"])
                
            
               name  group_A  group_B
0  alex     1        0
1  bob      0        1
2  cathy    1        0

Here, notice how the name column is not one-hot encoded.

One-hot encoding using a list

To build an one-hot encoded DataFrame from a list:


        
        
            
                
                
                    pd.get_dummies(["A","B","C","B"])
                
            
               A  B  C
0  1  0  0
1  0  1  0
2  0  0  1
3  0  1  0

We show df here again for your reference:


        
        
            
                
                
                    df
                
            
               name  group
0  alex   A
1  bob    B
2  cathy  A

Specifying prefix

By default, the column label of the categorical variables becomes the prefix of the new column labels:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"])
                
            
               name  group_A  group_B
0  alex    1        0
1  bob     0        1
2  cathy   1        0

We can specify a custom prefix by setting the prefix parameter:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"], prefix="Group")
                
            
               name   Group_A  Group_B
0  alex     1        0
1  bob      0        1
2  cathy    1        0

Specifying prefix_sep

By default, the separator between the prefix and value of the categorical variable is "_". We can change this to whatever we wish:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"], prefix_sep="@")
                
            
               name  group@A  group@B
0  alex    1       0
1  bob     0       1
2  cathy   1       0

Specifying dummy_na

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B",np.NaN]})
df
                
            
               name  group
0  alex   A
1  bob    B
2  cathy  NaN

Here, we've got a missing value (NaN) for Cathy's group.

By default, dummy_na=False, which means that a missing value will result in all 0s for that row:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"])
                
            
               name  group_A  group_B
0  alex     1        0
1  bob      0        1
2  cathy    0        0

A missing value can be treated as a category of each its own if we set dummy_na=True like so:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"], dummy_na=True)
                
            
               name  group_A  group_B   group_nan
0  alex     1        0         0
1  bob      0        1         0
2  cathy    0        0         1

Notice how we have a new column called group_nan.

Specifying sparse

One-hot encoding, by nature, results in a sparse set of columns (i.e. many 0s). In order to save memory usage, we can choose to use SparseArray to store the one-hot encoded columns instead of the conventional Numpy arrays.

The caveat is that SparseArray does not carry as many functionalities as Numpy arrays, so only set sparse=True when you are dealing with a large DataFrame that cause memory issues.

Consider the same df as above:


        
        
            
                
                
                    df
                
            
               name  group
0  alex   A
1  bob    B
2  cathy  A

Here's the default dtype of the dummy-encoded columns:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"]).dtypes
                
            
            name       object
group_A     uint8
group_B     uint8
dtype: object

Here's the dtype when we set sparse=True:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"], sparse=True).dtypes
                
            
            name                 object
group_A    Sparse[uint8, 0]
group_B    Sparse[uint8, 0]
dtype: object

We see that the internal representation of the dummy-encoded columns have changed.

Specifying drop_first

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})
df
                
            
               name   group
0  alex     A
1  bob      B
2  cathy    A

By default, drop_first=False, which means that each categorical variable gets a column of its own:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"])   # drop_first=False
                
            
               name  group_A  group_B
0  alex     1        0
1  bob      0        1
2  cathy    1        0

By setting drop_first=True, we drop one dummy-encoded column:


        
        
            
                
                
                    pd.get_dummies(df, columns=["group"], drop_first=True)
                
            
               name  group_B
0  alex     0
1  bob      1
2  cathy    0

The key here is that, even if we drop a single dummy-encoded column, we can still figure out what group a person belongs to.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official Pandas Documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!