Counting frequency of values in PySpark DataFrame Column
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+|col1|+----+| A|| A|| B|+----+
Counting frequency of values using aggregation (groupBy and count)
To count the frequency of values in column col1
:
Here, we are first grouping by the values in col1
, and then for each group, we are counting the number of rows.
Sorting PySpark DataFrame by frequency counts
The resulting PySpark DataFrame is not sorted by any particular order by default. We can sort the DataFrame by the count
column using the orderBy(~)
method:
Here, the output is similar to Pandas' value_counts(~)
method which returns the frequency counts in descending order.
Assigning label to count aggregate column
Similar to what we did with the methods groupBy(~)
and count()
, we can also use the agg(~)
method, which takes as input an aggregate function:
This is more verbose than the solution using groupBy(~)
and count()
, but the advantage is that we can use the alias(~)
method to assign a name to the resulting aggregate column - here the label is my_count
instead of the default count
.