Combining columns into a single column of arrays in PySpark DataFrame
Start your free 7-days trial now!
To combine multiple columns into a single column of arrays in PySpark DataFrame:
use the
array(~)
method in thepyspark.sql.functions
library to combine non-array columns.use the
concat(~)
method to combine multiple columns of type array together
Combining columns of non-array values into a single column
Consider the following PySpark DataFrame:
+-----+-----+|fname|lname|+-----+-----+| Alex| Jobs|| Bob|Miley||Cathy| Lee|+-----+-----+
To combine the columns fname
and lname
into a single column of arrays, use the array(~)
method:
Here:
we are using the
alias(~)
method to assign a label to the combined column returned byarray(~)
.we convert the PySpark Column returned by
array(~)
into a PySpark DataFrame using theselect(~)
method so that we can display the new column content viashow()
method.
The argument of array(~)
is of variable-length. This means that we can specify as many columns as we wish for merging:
F.array(col1,col2,col3)
We can see the data type of the merged column using the printSchema()
method:
root |-- merged: array (nullable = false) | |-- element: string (containsNull = true)
The output tells us that the merged column is of type array of strings.
Combining columns of arrays into a single column
Consider the following PySpark DataFrame containing two array-type columns:
+---+------+| A| B|+---+------+|[a]| [b]||[c]|[d, e]|+---+------+
To combine columns A
and B
as a single column of arrays: