Combining columns into a single column of arrays in PySpark DataFrame
Start your free 7-days trial now!
To combine multiple columns into a single column of arrays in PySpark DataFrame:
use the
array(~)method in thepyspark.sql.functionslibrary to combine non-array columns.use the
concat(~)method to combine multiple columns of type array together
Combining columns of non-array values into a single column
Consider the following PySpark DataFrame:
+-----+-----+|fname|lname|+-----+-----+| Alex| Jobs|| Bob|Miley||Cathy| Lee|+-----+-----+
To combine the columns fname and lname into a single column of arrays, use the array(~) method:
Here:
we are using the
alias(~)method to assign a label to the combined column returned byarray(~).we convert the PySpark Column returned by
array(~)into a PySpark DataFrame using theselect(~)method so that we can display the new column content viashow()method.
The argument of array(~) is of variable-length. This means that we can specify as many columns as we wish for merging:
F.array(col1,col2,col3)
We can see the data type of the merged column using the printSchema() method:
root |-- merged: array (nullable = false) | |-- element: string (containsNull = true)
The output tells us that the merged column is of type array of strings.
Combining columns of arrays into a single column
Consider the following PySpark DataFrame containing two array-type columns:
+---+------+| A| B|+---+------+|[a]| [b]||[c]|[d, e]|+---+------+
To combine columns A and B as a single column of arrays: