Pandas | get_dummies method
Start your free 7-days trial now!
Pandas get_dummies(~)
method performs one-hot encoding or dummy coding on categorical variables.
Parameters
1. data
link | array-like
or DataFrame
The source data whose categorical variables will be one-hot encoded.
2. prefix
link | string
or list<string>
or dict
| optional
The prefix to append to the label of the dummy-encoded columns. By default, prefix=None
.
3. prefix_sep
link | string
| optional
The separator to use between prefix and the column name. prefix
must be specified for this to take effect. By default, prefix_sep="_"
.
4. dummy_na
link | boolean
| optional
Whether or not to append a new column that indicates a missing value. By default, dummy_na=False
.
5. columns
| array-like
| optional
The label of the columns that will be one-hot encoded. By default, columns=None
.
6. sparse
link | boolean
| optional
Whether or not to use a SparseArray
to represent the dummy-encoded columns. By default, sparse=False
.
7. drop_first
link | boolean
| optional
Whether or not to remove one dummy-encoded column. By default, drop_first=False
.
8. dtype
| dtype
| optional
The data type of the new dummy columns. By default, dtype=np.uint8
.
Return Value
A DateFrame
whose categorical variables have been one-hot encoded.
Examples
Basic usage
Consider the following DataFrame:
df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})df
name group0 alex A1 bob B2 cathy A
Here, the column group
holds categorical variables. However, by default, all strings will be interpreted as categorical variables - this is undesirable in this case since we know that name
is not a categorical variable:
pd.get_dummies(df)
name_alex name_bob name_cathy group_A group_B0 1 0 0 1 01 0 1 0 0 12 0 0 1 1 0
In order to specify that the group
column is the categorical variable to one-hot encode, we just need to set the columns
parameter, like so:
pd.get_dummies(df, columns=["group"])
name group_A group_B0 alex 1 01 bob 0 12 cathy 1 0
Here, notice how the name
column is not one-hot encoded.
One-hot encoding using a list
To build an one-hot encoded DataFrame from a list:
pd.get_dummies(["A","B","C","B"])
A B C0 1 0 01 0 1 02 0 0 13 0 1 0
We show df
here again for your reference:
df
name group0 alex A1 bob B2 cathy A
Specifying prefix
By default, the column label of the categorical variables becomes the prefix of the new column labels:
pd.get_dummies(df, columns=["group"])
name group_A group_B0 alex 1 01 bob 0 12 cathy 1 0
We can specify a custom prefix by setting the prefix
parameter:
pd.get_dummies(df, columns=["group"], prefix="Group")
name Group_A Group_B0 alex 1 01 bob 0 12 cathy 1 0
Specifying prefix_sep
By default, the separator between the prefix and value of the categorical variable is "_"
. We can change this to whatever we wish:
pd.get_dummies(df, columns=["group"], prefix_sep="@")
name group@A group@B0 alex 1 01 bob 0 12 cathy 1 0
Specifying dummy_na
Consider the following DataFrame:
df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B",np.NaN]})df
name group0 alex A1 bob B2 cathy NaN
Here, we've got a missing value (NaN
) for Cathy's group.
By default, dummy_na=False
, which means that a missing value will result in all 0
s for that row:
pd.get_dummies(df, columns=["group"])
name group_A group_B0 alex 1 01 bob 0 12 cathy 0 0
A missing value can be treated as a category of each its own if we set dummy_na=True
like so:
pd.get_dummies(df, columns=["group"], dummy_na=True)
name group_A group_B group_nan0 alex 1 0 01 bob 0 1 02 cathy 0 0 1
Notice how we have a new column called group_nan
.
Specifying sparse
One-hot encoding, by nature, results in a sparse set of columns (i.e. many 0
s). In order to save memory usage, we can choose to use SparseArray
to store the one-hot encoded columns instead of the conventional Numpy arrays.
The caveat is that SparseArray
does not carry as many functionalities as Numpy arrays, so only set sparse=True
when you are dealing with a large DataFrame that cause memory issues.
Consider the same df
as above:
df
name group0 alex A1 bob B2 cathy A
Here's the default dtype
of the dummy-encoded columns:
pd.get_dummies(df, columns=["group"]).dtypes
name objectgroup_A uint8group_B uint8dtype: object
Here's the dtype
when we set sparse=True
:
pd.get_dummies(df, columns=["group"], sparse=True).dtypes
name objectgroup_A Sparse[uint8, 0]group_B Sparse[uint8, 0]dtype: object
We see that the internal representation of the dummy-encoded columns have changed.
Specifying drop_first
Consider the following DataFrame:
df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})df
name group0 alex A1 bob B2 cathy A
By default, drop_first=False
, which means that each categorical variable gets a column of its own:
pd.get_dummies(df, columns=["group"]) # drop_first=False
name group_A group_B0 alex 1 01 bob 0 12 cathy 1 0
By setting drop_first=True
, we drop one dummy-encoded column:
pd.get_dummies(df, columns=["group"], drop_first=True)
name group_B0 alex 01 bob 12 cathy 0
The key here is that, even if we drop a single dummy-encoded column, we can still figure out what group a person belongs to.