PySpark SQL Functions | count_distinct method
Start your free 7-days trial now!
PySpark SQL Functions' count_distinct(~) method counts the number of distinct values in the specified columns.
Parameters
1. *cols | string or Column
The columns in which to count the number of distinct values.
Return Value
A PySpark Column holding an integer.
Examples
Consider the following PySpark DataFrame:
+-----+-----+| name|class|+-----+-----+| Alex| A|| Bob| A||Cathy| B|+-----+-----+
Counting the number of distinct values in a single column in PySpark
To count the number of distinct values in the class column:
Here, we are giving the name "c" to the Column returned by count_distinct(~) via alias(~).
Note that we could also supply a Column object to count_distinct(~) instead:
Obtaining an integer count
By default, count_distinct(~) returns a PySpark Column. To get an integer count instead:
Here, we are use the select(~) method to convert the Column into PySpark DataFrame. We then use the collect(~) method to convert the DataFrame into a list of Row objects. Since there is only one Row in this list as well as one value in the Row, we use [0][0] to access the integer count.
Counting the number of distinct values in a set of columns in PySpark
To count the number of distinct values for the columns name and class: