PySpark SQL Functions | collect_list method
Start your free 7-days trial now!
PySpark SQL functions' collect_list(~)
method returns a list of values in a column. Unlike collect_set(~)
, the returned list can contain duplicate values. Null values are ignored.
Parameters
1. col
| string
or Column
object
The column label or a Column
object.
Return Value
A PySpark SQL Column
object (pyspark.sql.column.Column
).
Assume that the order of the returned list may be random since the order is affected by shuffle operations.
Examples
Consider the following PySpark DataFrame:
data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]
+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
Getting a list of column values in PySpark
To get the a list of values in the group
column:
Notice the following:
we have duplicate values (
A
).null values are ignored.
Equivalently, you can pass in a Column
object to collect_list(~)
as well:
Obtaining a standard list
To obtain a standard list instead:
Here, the collect()
method returns the content of the PySpark DataFrame returned by select(~)
as a list of Row
objects. This list is guaranteed to be of length one because collect_list(~)
collects the values into a single list. Finally, we access the content of the Row
object using [0]
.
Getting a list of column values for each group in PySpark
The method collect_list(~)
is often used in the context of aggregation. Consider the same PySpark DataFrame as above:
+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
To flatten the group
column into a single list for each name
: