df = spark.createDataFrame([["Alex", "A"], ["Bob", "A"], ["Cathy", "B"]], ["name", "class"])
df.show()
                
            
            +-----+-----+
| name|class|
+-----+-----+
| Alex|    A|
|  Bob|    A|
|Cathy|    B|
+-----+-----+

Counting the number of distinct values in a single column in PySpark

To count the number of distinct values in the class column:


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.count_distinct("class").alias("c")).show()
                
            
            +---+
|  c|
+---+
|  2|
+---+

Here, we are giving the name "c" to the Column returned by count_distinct(~) via alias(~).

Note that we could also supply a Column object to count_distinct(~) instead:


        
        
            
                
                
                    df.select(F.count_distinct(df["class"]).alias("c")).show()
                
            
            +---+
|  c|
+---+
|  2|
+---+

Obtaining an integer count

By default, count_distinct(~) returns a PySpark Column. To get an integer count instead:


        
        
            
                
                
                    df.select(F.count_distinct(df["class"])).collect()[0][0]
                
            
            2

Here, we are use the select(~) method to convert the Column into PySpark DataFrame. We then use the collect(~) method to convert the DataFrame into a list of Row objects. Since there is only one Row in this list as well as one value in the Row, we use [0][0] to access the integer count.

Counting the number of distinct values in a set of columns in PySpark

To count the number of distinct values for the columns name and class:


        
        
            
                
                
                    df.select(F.count_distinct("name", "class").alias("c")).show()
                
            
            +---+
|  c|
+---+
|  3|
+---+

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.count_distinct.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!