PySpark DataFrame | groupBy method
Start your free 7-days trial now!
PySpark DataFrame's groupBy(~)
method aggregates rows based on the specified columns. We can then compute statistics such as the mean for each of these groups.
Parameters
1. cols
| list
or string
or Column
| optional
The columns to group by. By default, all rows will be grouped together.
Return Value
The GroupedData
object (pyspark.sql.group.GroupedData
).
Examples
Consider the following PySpark DataFrame:
["Bob", "IT", 24, 400],\ ["Cathy", "HR", 22, 600]],\ ["name", "department", "age", "salary"])
+-----+----------+---+------+| name|department|age|salary|+-----+----------+---+------+| Alex| IT| 20| 500|| Bob| IT| 24| 400||Cathy| HR| 22| 600|+-----+----------+---+------+
Basic usage
By default, groupBy()
without any arguments will group all rows together, and will compute statistics for each numeric column:
df.groupby().max().show()
+--------+-----------+|max(age)|max(salary)|+--------+-----------+| 24| 600|+--------+-----------+
Grouping by a single column and computing statistic of all columns of each group
To get the highest age
and salary
in each department:
df.groupBy("department").max().show()
+----------+--------+-----------+|department|max(age)|max(salary)|+----------+--------+-----------+| IT| 24| 500|| HR| 22| 600|+----------+--------+-----------+
Instead of referring to the column by its label (string
), we can also use SQL.functions.col(~)
:
from pyspark.sql import functions as F
+----------+--------+-----------+|department|max(age)|max(salary)|+----------+--------+-----------+| IT| 24| 500|| HR| 22| 600|+----------+--------+-----------+
Grouping by a single column and computing statistic of specific columns of each group
To get the highest age
only instead of all numeric columns:
df.groupby("department").max("age").show()
+----------+--------+|department|max(age)|+----------+--------+| IT| 24|| HR| 22|+----------+--------+
Equivalently, we can use the agg(~)
method and use one of SQL.functions
' aggregate functions:
df.groupby("department").agg(F.max("age")).show()
+----------+--------+|department|max(age)|+----------+--------+| IT| 24|| HR| 22|+----------+--------+
The following aggregate functions are supported in PySpark:
agg, avg, count, max, mean, min, pivot, sum
Grouping by with aliases for the aggregated column
By default, computing the max age
of each group will result in the column label max(age)
:
df.groupby("department").max("age").show()
+----------+--------+|department|max(age)|+----------+--------+| IT| 24|| HR| 22|+----------+--------+
To use an alias, we need to use the function agg(~)
instead:
import pyspark.sql.functions as Fdf.groupby("department").agg(F.max("age").alias("max_age")).show()
+----------+-------+|department|max_age|+----------+-------+| IT| 24|| HR| 22|+----------+-------+
Grouping by and computing multiple statistics
To compute multiple statistics at once:
import pyspark.sql.functions as Fdf.groupby("department").agg(F.max("age").alias("max"), F.min("age"), F.avg("salary")).show()
+----------+--------+--------+-----------------+|department| max|min(age)| avg(salary)|+----------+--------+--------+-----------------+| IT| 26| 20|566.6666666666666|| HR| 22| 22| 600.0|+----------+--------+--------+-----------------+
Grouping by multiple columns and computing statistic
Consider the following PySpark DataFrame:
["Bob", "junior", "IT", 24, 400],\ ["Cathy", "junior", "HR", 22, 600],\ ["Doge", "senior", "IT", 26, 800]],\ ["name", "position", "department", "age", "salary"])
+-----+--------+----------+---+------+| name|position|department|age|salary|+-----+--------+----------+---+------+| Alex| junior| IT| 20| 500|| Bob| junior| IT| 24| 400||Cathy| junior| HR| 22| 600|| Doge| senior| IT| 26| 800|+-----+--------+----------+---+------+
To group by position
and department
, and then computing the max age
of each of these groups:
df.groupby(["position", "department"]).max("age").show()
+--------+----------+--------+|position|department|max(age)|+--------+----------+--------+| junior| IT| 24|| junior| HR| 22|| senior| IT| 26|+--------+----------+--------+