rows = [['Alex','B'], ['Bob','A'], ['Cathy','B'], ['Dave','C'], ['Eric','D']]
df = spark.createDataFrame(rows, ['name','class'])
df.show()
                
            
            +-----+-----+
| name|class|
+-----+-----+
| Alex|    B|
|  Bob|    A|
|Cathy|    B|
| Dave|    C|
| Eric|    D|
+-----+-----+

Our goal is to one-hot encode the categorical column class.

The first step is to convert the class column into a numeric column using StringIndexer:


        
        
            
                
                
                    from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='class', outputCol='class_numeric')
indexer_fitted = indexer.fit(df)
df_indexed = indexer_fitted.transform(df)
df_indexed.show()
                
            
            +-----+-----+-------------+
| name|class|class_numeric|
+-----+-----+-------------+
| Alex|    B|          0.0|
|  Bob|    A|          1.0|
|Cathy|    B|          0.0|
| Dave|    C|          2.0|
| Eric|    D|          3.0|
+-----+-----+-------------+

Here, note the following:

the inputCol argument is the label of the categorical column, while outputCol is the label of the new numerically encoded column.
we need to call both the methods fit(~) and transform(~) on our PySpark DataFrame.
the numeric category that is assigned will depend on the frequency of the category. By default stringOrderType='frequencyDesc', which means that the class that occurs the most will be assigned the category index of 0. In this case, class B occurs the most and so it is assigned a category index of 0. You can reverse this by setting stringOrderType='frequencyAsc'.
the indexer_fitted object has a labels property holding the mapped column labels:
indexer_fitted.labels ['B', 'A', 'C', 'D']

Now that we have converted the categorical strings into categorical indexes, we can use PySpark's OneHotEncoder module to perform one-hot encoding:


        
        
            
                
                
                    from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCols=['class_numeric'], outputCols=['class_onehot'])
df_onehot = encoder.fit(df_indexed).transform(df_indexed)
df_onehot.show()
                
            
            +-----+-----+-------------+-------------+
| name|class|class_numeric| class_onehot|
+-----+-----+-------------+-------------+
| Alex|    B|          0.0|(3,[0],[1.0])|
|  Bob|    A|          1.0|(3,[1],[1.0])|
|Cathy|    B|          0.0|(3,[0],[1.0])|
| Dave|    C|          2.0|(3,[2],[1.0])|
| Eric|    D|          3.0|    (3,[],[])|
+-----+-----+-------------+-------------+

Here, after performing OneHotEncoder's fit(~) and transform(~) on our PySpark DataFrame, we end up with a new column as specified by the outputCols argument. Since one-hot encoded vectors typically have a large number of zeroes, PySpark uses the column type (sparse) vector for one-hot encoding:


        
        
            
                
                
                    df_onehot.printSchema()
                
            
            root
 |-- name: string (nullable = true)
 |-- class: string (nullable = true)
 |-- class_numeric: double (nullable = false)
 |-- class_onehot: vector (nullable = true)

A sparse vector is defined by three values (in order):

size: the size of the vector (the number of categories minus one)
index: the index in the vector that holds value
value: the value at index

Let's take the vector (3,[0],[1.0]) as an example. The size of the vector is 3 even though we have 4 unique categories (A,B,C,D) because one category is used as the base category - we will explain this part in a bit. The middle value [0] and the third value [1.0] means that the index position 0 in the vector should be filled with a 1.0. All other values in the sparse vector are filled with zeros. Since the vectors in this column represent one-hot encoded vectors, the third value will always be 1.0.

Now, let's take a look at the last one-hot encoded vector (3,[],[]). The second and third values are both empty []. This means that the vector is just filled with zeroes, that is, category D is treated as a base category. This is the reason why we can represent 4 unique categories with a vector of size 3.

Note that we can still choose to represent our unique categories without using a base category by supplying the argument dropLast=False:


        
        
            
                
                
                    encoder = OneHotEncoder(inputCols=['class_numeric'], outputCols=['class_onehot'], dropLast=False)
df_onehot_no_base = encoder.fit(df_indexed).transform(df_indexed)
df_onehot_no_base.show()
                
            
            +-----+-----+-------------+-------------+
| name|class|class_numeric| class_onehot|
+-----+-----+-------------+-------------+
| Alex|    B|          0.0|(4,[0],[1.0])|
|  Bob|    A|          1.0|(4,[1],[1.0])|
|Cathy|    B|          0.0|(4,[0],[1.0])|
| Dave|    C|          2.0|(4,[2],[1.0])|
| Eric|    D|          3.0|(4,[3],[1.0])|
+-----+-----+-------------+-------------+

Here, notice how the size of our vectors is 4 instead of 0 and also how category D is assigned an index of 3.

One-hot encoding categorical columns as a set of binary columns (dummy encoding)

The OneHotEncoder module encodes a numeric categorical column using a sparse vector, which is useful as inputs of PySpark's machine learning models such as decision trees (DecisionTreeClassifier).

However, you may want the one-hot encoding to be done in a similar way to Pandas' get_dummies(~) method that produces a set of binary columns instead. In this section, we will convert the sparse vector into binary one-hot encoded columns.

We begin by converting the sparse vectors into arrays using the vector_to_array(~) method:


        
        
            
                
                
                    from pyspark.ml.functions import vector_to_array
df_col_onehot = df_onehot.select('*', vector_to_array('class_onehot').alias('col_onehot'))
df_col_onehot.show()
                
            
            +-----+-----+-------------+-------------+---------------+
| name|class|class_numeric| class_onehot|     col_onehot|
+-----+-----+-------------+-------------+---------------+
| Alex|    B|          0.0|(3,[0],[1.0])|[1.0, 0.0, 0.0]|
|  Bob|    A|          1.0|(3,[1],[1.0])|[0.0, 1.0, 0.0]|
|Cathy|    B|          0.0|(3,[0],[1.0])|[1.0, 0.0, 0.0]|
| Dave|    C|          2.0|(3,[2],[1.0])|[0.0, 0.0, 1.0]|
| Eric|    D|          3.0|    (3,[],[])|[0.0, 0.0, 0.0]|
+-----+-----+-------------+-------------+---------------+

Here, note the following:

'*' refers to all columns in df_onehot.
the alias(~) method assigns a label to the column returned by vector_to_array(~).

Next, we will unpack this column of arrays into a set of columns:


        
        
            
                
                
                    import pyspark.sql.functions as F
num_categories = len(df_col_onehot.first()['col_onehot'])   # 3
cols_expanded = [(F.col('col_onehot')[i]) for i in range(num_categories)]
df_cols_onehot = df_col_onehot.select('name', 'class', *cols_expanded)
df_cols_onehot.show()
                
            
            +-----+-----+-------------+-------------+-------------+
| name|class|col_onehot[0]|col_onehot[1]|col_onehot[2]|
+-----+-----+-------------+-------------+-------------+
| Alex|    B|          1.0|          0.0|          0.0|
|  Bob|    A|          0.0|          1.0|          0.0|
|Cathy|    B|          1.0|          0.0|          0.0|
| Dave|    C|          0.0|          0.0|          1.0|
| Eric|    D|          0.0|          0.0|          0.0|
+-----+-----+-------------+-------------+-------------+

Here, note the following:

we are first fetching the number of categories. The first(~) method returns the first row as a Row object and the length of an array in the col_onehot column represents the number of categories (minus one since we are using one category as the base category).
we then use list comprehension to obtain a list of binary columns. F.col('col_onehot')[2] for instance will return a Column holding the 3rd value of each list.
the * in *cols_expanded unpacks the list of Column objects into positional arguments.

Finally, notice how the encoded binary columns have awkward labels like col_onehot[0] by default. We can convert their labels to their corresponding categorical labels by slightly tweaking the following line of the previous code snippet:


        
        
            
                
                
                    num_categories = len(df_col_onehot.first()['col_onehot'])   # 3
cols_expanded = [(F.col('col_onehot')[i].alias(f'{indexer_fitted.labels[i]}')) for i in range(num_categories)]
df_cols_onehot = df_col_onehot.select('name', 'class', *cols_expanded)
df_cols_onehot.show()
                
            
            +-----+-----+---+---+---+
| name|class|  B|  A|  C|
+-----+-----+---+---+---+
| Alex|    B|1.0|0.0|0.0|
|  Bob|    A|0.0|1.0|0.0|
|Cathy|    B|1.0|0.0|0.0|
| Dave|    C|0.0|0.0|1.0|
| Eric|    D|0.0|0.0|0.0|
+-----+-----+---+---+---+

Here we are using the PySpark column's alias(~) method to assign the original categorical labels given by indexer_fitted.labels:


        
        
            
                
                
                    indexer_fitted.labels
                
            
            ['B', 'A', 'C', 'D']

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!