Pandas DataFrame | groupby method
Start your free 7-days trial now!
Pandas's DataFrame.groupby(~)
divides up your DataFrame into groups based on the specified criteria. The returned value is useful because it allows you to compute statistics (e.g. mean and min) and apply transformations group-wise.
Parameters
1. by
| scalar
or array-like
or dict
The criteria by which to divide up the DataFrame.
2. axis
| int
or string
| optional
Whether to divide the DataFrame into columns or rows:
Axis | Description |
---|---|
| DataFrame will be divided into columns. |
| DataFrame will be divided into rows. |
By default, axis=0
.
3. level
| int
or string
| optional
The level to target. This is only relevant if the source DataFrame is multi-index. By default, level=None
.
4. as_index
link | boolean
| optional
Whether or not the group labels is to be used as the index of the resulting DataFrame. By default, as_index=True
.
5. sort
link | boolean
| optional
Whether or not to sort the groups by group labels. By default, sort=True
. For performance gains, consider passing False
when this behaviour is not needed.
6. group_keys
link | boolean
| optional
Whether or not to include the group labels in the index if the aggregation function we apply changes the index. Check examples below for clarification. By default, group_keys=True
.
7. squeeze
link | boolean
| optional
Whether or not to return a simplified type if possible. Consult examples below for clarification, By default, squeeze=False
.
Return Value
A DataFrameGroupBy
object.
Examples
Consider the following DataFrame:
df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
Grouping by a single column
To divide up the DataFrame using the categories of the brand
column:
groups_brand = df.groupby("brand") # Returns a groupby object
We can look at the partitions using the property groups
:
df.groupby("brand").groups
{'apple': Int64Index([0, 2], dtype='int64'), 'google': Int64Index([1, 3], dtype='int64')}
As we can see, the DataFrame was divided up into two groups: apple
brand and google
brand. The rows corresponding to apple
brand are index 0
and index 2
, whereas those corresponding to google
brand are index 1
and 3
.
There are many things you can do with this returned groupby
object, such as computing the mean price of each brand:
pricebrandapple 450google 600
Notice how the mean
of the price column was computed, while that of the device
column was not, even though we did not explicitly specify the columns. This is because the aggregate functions only apply to numeric columns only.
Grouping by multiple columns
To determine the mean price of apple phones, apple computers, google phones and google computers:
pricebrand device apple computer 700 phone 200google phone 600
Note that google
computers do not exist in df
, so that's why you don't see them in the output.
Just for your reference, we show df
here again:
df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
Iterating through the groupby object
To iterate through all groups of a groupby
object:
Using aggregate functions
To compute the mean price of apple
and google
devices:
pricebrandapple 450google 600
Note that the groupby
object has instance methods for common aggregates:
pricebrand apple 450google 600
You can also pass in a list of functions for agg
to compute multiple aggregates:
Click here for more information about aggregate functions.
Applying transformation group-wise
Here is df
again for your reference:
df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
To apply a transformation group-wise, use transform(~)
:
Note the following:
this is a completely arbitrary example where we are shifting each value by the maximum of the group it belongs in.
we obtain
-500
in index0
because the maximum price of the groupbrand=apple
is700
, and so200-700=-500
.the argument passed into our function (
col
) is of typeSeries
, and it represents a single column of a group.the return type of the entire code snippet is a
DataFrame
.
On a more practical note, we often apply transformations group-wise for standardisation.
Including only a subset of columns in the returned result
Consider the following DataFrame:
df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})df
pricing rating brand0 200 3 apple1 300 4 google2 700 5 apple3 900 3 google
We have two numerical columns here: price
and rating
. By default, calling aggregates like mean()
results in computing the aggregates for all numeric columns. For instance:
price ratingbrand apple 450.0 4.0google 600.0 3.5
To compute aggregates for only the price
column, you can use []
notation directly on the groupby
object:
brandapple 450google 600Name: price, dtype: int64
Using keyword arguments to name columns
When using the agg
method, you can assign column labels to the resulting DataFrame by providing a keyword argument:
mean_pricebrand apple 450google 600
For this to work, you must specify which column to aggregate ("price"
in this case).
Specifying as_index
Consider the following DataFrame:
df
price branda 200 Cb 300 Bc 700 Ad 900 B
By default, as_index=True
, which means that the group labels will be used as the index of the resulting DataFrame:
pricebrand A 700B 600C 200
Setting as_index=False
would set the group labels as a column instead:
brand price0 A 7001 B 6002 C 200
Notice how the index is the default integer index ([0,1,2]
).
Specifying sort
Consider the following DataFrame:
df
price branda 200 Cb 300 Bc 700 Ad 900 B
By default, sort=True
, which means that the group labels will be sorted:
pricebrand A 700B 600C 200
Notice how the index of the resulting DataFrame is sorted in ascending order.
If we don't need this behaviour, then, set sort=False
like so:
pricebrand C 200B 600A 700
Specifying group_keys
Consider the following DataFrame:
df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])df
price branda 200 appleb 300 googlec 700 appled 900 google
By default, group_keys=True
, which means that group names will be in the index of the resulting DataFrame:
index price brandbrand apple 0 a 200 apple 1 c 700 applegoogle 0 b 300 google 1 d 900 google
Note that group_keys
takes effect only when the function we apply changes the index of the resulting DataFrame, just as with reset_index(~)
.
Setting group_keys=False
would remove the group names from the resulting DataFrame:
index price brand0 a 200 apple1 c 700 apple0 b 300 google1 d 900 google
Notice how the brands are no longer present in the index.
Specifying squeeze
By default, squeeze=False
, which means that the return type will not be simplified even when possible.
Consider the following DataFrame:
df
A0 21 3
This is a completely arbitrary example, but suppose we group by column A
, and then for each group, apply a function that literally just returns a Series containing "b"
:
Here, the return type is DataFrame
, which could be simplified down to a Series
.
To simplify our return type to Series
, set squeeze=True
in groupby(~)
like so:
Here, the return type is Series
.