One-hot encoding in PySpark
Start your free 7-days trial now!
To perform one-hot encoding in PySpark, we must:
convert the categorical column into a numeric column (
0
,1
, ...) usingStringIndexer
convert the numeric column into one-hot encoded columns using
OneHotEncoder
One-hot encoding categorical columns as sparse vector
Consider the following PySpark DataFrame:
rows = [['Alex','B'], ['Bob','A'], ['Cathy','B'], ['Dave','C'], ['Eric','D']]
+-----+-----+| name|class|+-----+-----+| Alex| B|| Bob| A||Cathy| B|| Dave| C|| Eric| D|+-----+-----+
Our goal is to one-hot encode the categorical column class
.
The first step is to convert the class
column into a numeric column using StringIndexer
:
from pyspark.ml.feature import StringIndexerindexer = StringIndexer(inputCol='class', outputCol='class_numeric')indexer_fitted = indexer.fit(df)df_indexed = indexer_fitted.transform(df)
+-----+-----+-------------+| name|class|class_numeric|+-----+-----+-------------+| Alex| B| 0.0|| Bob| A| 1.0||Cathy| B| 0.0|| Dave| C| 2.0|| Eric| D| 3.0|+-----+-----+-------------+
Here, note the following:
the
inputCol
argument is the label of the categorical column, whileoutputCol
is the label of the new numerically encoded column.we need to call both the methods
fit(~)
andtransform(~)
on our PySpark DataFrame.the numeric category that is assigned will depend on the frequency of the category. By default
stringOrderType='frequencyDesc'
, which means that theclass
that occurs the most will be assigned the category index of0
. In this case, classB
occurs the most and so it is assigned a category index of0
. You can reverse this by settingstringOrderType='frequencyAsc'
.the
indexer_fitted
object has a labels property holding the mapped column labels:indexer_fitted.labels['B', 'A', 'C', 'D']
Now that we have converted the categorical strings into categorical indexes, we can use PySpark's OneHotEncoder
module to perform one-hot encoding:
from pyspark.ml.feature import OneHotEncoderencoder = OneHotEncoder(inputCols=['class_numeric'], outputCols=['class_onehot'])df_onehot = encoder.fit(df_indexed).transform(df_indexed)
+-----+-----+-------------+-------------+| name|class|class_numeric| class_onehot|+-----+-----+-------------+-------------+| Alex| B| 0.0|(3,[0],[1.0])|| Bob| A| 1.0|(3,[1],[1.0])||Cathy| B| 0.0|(3,[0],[1.0])|| Dave| C| 2.0|(3,[2],[1.0])|| Eric| D| 3.0| (3,[],[])|+-----+-----+-------------+-------------+
Here, after performing OneHotEncoder
's fit(~)
and transform(~)
on our PySpark DataFrame, we end up with a new column as specified by the outputCols
argument. Since one-hot encoded vectors typically have a large number of zeroes, PySpark uses the column type (sparse) vector
for one-hot encoding:
root |-- name: string (nullable = true) |-- class: string (nullable = true) |-- class_numeric: double (nullable = false) |-- class_onehot: vector (nullable = true)
A sparse vector
is defined by three values (in order):
size
: the size of the vector (the number of categories minus one)index
: the index in the vector that holdsvalue
value
: the value atindex
Let's take the vector (3,[0],[1.0])
as an example. The size of the vector is 3 even though we have 4 unique categories (A
,B
,C
,D
) because one category is used as the base category - we will explain this part in a bit. The middle value [0]
and the third value [1.0]
means that the index position 0
in the vector should be filled with a 1.0
. All other values in the sparse vector are filled with zeros. Since the vectors in this column represent one-hot encoded vectors, the third value will always be 1.0
.
Now, let's take a look at the last one-hot encoded vector (3,[],[])
. The second and third values are both empty []
. This means that the vector is just filled with zeroes, that is, category D
is treated as a base category. This is the reason why we can represent 4 unique categories with a vector of size 3.
Note that we can still choose to represent our unique categories without using a base category by supplying the argument dropLast=False
:
encoder = OneHotEncoder(inputCols=['class_numeric'], outputCols=['class_onehot'], dropLast=False)df_onehot_no_base = encoder.fit(df_indexed).transform(df_indexed)
+-----+-----+-------------+-------------+| name|class|class_numeric| class_onehot|+-----+-----+-------------+-------------+| Alex| B| 0.0|(4,[0],[1.0])|| Bob| A| 1.0|(4,[1],[1.0])||Cathy| B| 0.0|(4,[0],[1.0])|| Dave| C| 2.0|(4,[2],[1.0])|| Eric| D| 3.0|(4,[3],[1.0])|+-----+-----+-------------+-------------+
Here, notice how the size of our vectors is 4
instead of 0
and also how category D
is assigned an index of 3
.
One-hot encoding categorical columns as a set of binary columns (dummy encoding)
The OneHotEncoder
module encodes a numeric categorical column using a sparse vector, which is useful as inputs of PySpark's machine learning models such as decision trees (DecisionTreeClassifier
).
However, you may want the one-hot encoding to be done in a similar way to Pandas' get_dummies(~)
method that produces a set of binary columns instead. In this section, we will convert the sparse vector into binary one-hot encoded columns.
We begin by converting the sparse vectors into arrays using the vector_to_array(~)
method:
+-----+-----+-------------+-------------+---------------+| name|class|class_numeric| class_onehot| col_onehot|+-----+-----+-------------+-------------+---------------+| Alex| B| 0.0|(3,[0],[1.0])|[1.0, 0.0, 0.0]|| Bob| A| 1.0|(3,[1],[1.0])|[0.0, 1.0, 0.0]||Cathy| B| 0.0|(3,[0],[1.0])|[1.0, 0.0, 0.0]|| Dave| C| 2.0|(3,[2],[1.0])|[0.0, 0.0, 1.0]|| Eric| D| 3.0| (3,[],[])|[0.0, 0.0, 0.0]|+-----+-----+-------------+-------------+---------------+
Here, note the following:
'*'
refers to all columns indf_onehot
.the
alias(~)
method assigns a label to the column returned byvector_to_array(~)
.
Next, we will unpack this column of arrays into a set of columns:
+-----+-----+-------------+-------------+-------------+| name|class|col_onehot[0]|col_onehot[1]|col_onehot[2]|+-----+-----+-------------+-------------+-------------+| Alex| B| 1.0| 0.0| 0.0|| Bob| A| 0.0| 1.0| 0.0||Cathy| B| 1.0| 0.0| 0.0|| Dave| C| 0.0| 0.0| 1.0|| Eric| D| 0.0| 0.0| 0.0|+-----+-----+-------------+-------------+-------------+
Here, note the following:
we are first fetching the number of categories. The
first(~)
method returns the first row as aRow
object and the length of an array in thecol_onehot
column represents the number of categories (minus one since we are using one category as the base category).we then use list comprehension to obtain a list of binary columns.
F.col('col_onehot')[2]
for instance will return aColumn
holding the 3rd value of each list.the
*
in*cols_expanded
unpacks the list ofColumn
objects into positional arguments.
Finally, notice how the encoded binary columns have awkward labels like col_onehot[0]
by default. We can convert their labels to their corresponding categorical labels by slightly tweaking the following line of the previous code snippet:
+-----+-----+---+---+---+| name|class| B| A| C|+-----+-----+---+---+---+| Alex| B|1.0|0.0|0.0|| Bob| A|0.0|1.0|0.0||Cathy| B|1.0|0.0|0.0|| Dave| C|0.0|0.0|1.0|| Eric| D|0.0|0.0|0.0|+-----+-----+---+---+---+
Here we are using the PySpark column's alias(~)
method to assign the original categorical labels given by indexer_fitted.labels
:
indexer_fitted.labels
['B', 'A', 'C', 'D']