PySpark SQL Functions | count_distinct method
Start your free 7-days trial now!
PySpark SQL Functions' count_distinct(~)
method counts the number of distinct values in the specified columns.
Parameters
1. *cols
| string
or Column
The columns in which to count the number of distinct values.
Return Value
A PySpark Column
holding an integer.
Examples
Consider the following PySpark DataFrame:
+-----+-----+| name|class|+-----+-----+| Alex| A|| Bob| A||Cathy| B|+-----+-----+
Counting the number of distinct values in a single column in PySpark
To count the number of distinct values in the class
column:
Here, we are giving the name "c"
to the Column
returned by count_distinct(~)
via alias(~)
.
Note that we could also supply a Column
object to count_distinct(~)
instead:
Obtaining an integer count
By default, count_distinct(~)
returns a PySpark Column
. To get an integer count instead:
Here, we are use the select(~)
method to convert the Column
into PySpark DataFrame. We then use the collect(~)
method to convert the DataFrame into a list of Row
objects. Since there is only one Row
in this list as well as one value in the Row
, we use [0][0]
to access the integer count.
Counting the number of distinct values in a set of columns in PySpark
To count the number of distinct values for the columns name
and class
: