search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Pandas DataFrame | groupby method

schedule Aug 10, 2023
Last updated
local_offer
PythonPandas
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Pandas's DataFrame.groupby(~) divides up your DataFrame into groups based on the specified criteria. The returned value is useful because it allows you to compute statistics (e.g. mean and min) and apply transformations group-wise.

Parameters

1. by | scalar or array-like or dict

The criteria by which to divide up the DataFrame.

2. axis | int or string | optional

Whether to divide the DataFrame into columns or rows:

Axis

Description

0 or "index"

DataFrame will be divided into columns.

1 or "columns"

DataFrame will be divided into rows.

By default, axis=0.

3. level | int or string | optional

The level to target. This is only relevant if the source DataFrame is multi-index. By default, level=None.

4. as_indexlink | boolean | optional

Whether or not the group labels is to be used as the index of the resulting DataFrame. By default, as_index=True.

5. sortlink | boolean | optional

Whether or not to sort the groups by group labels. By default, sort=True. For performance gains, consider passing False when this behaviour is not needed.

6. group_keyslink | boolean | optional

Whether or not to include the group labels in the index if the aggregation function we apply changes the index. Check examples below for clarification. By default, group_keys=True.

7. squeezelink | boolean | optional

Whether or not to return a simplified type if possible. Consult examples below for clarification, By default, squeeze=False.

Return Value

A DataFrameGroupBy object.

Examples

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})
df
price brand device
0 200 apple phone
1 300 google phone
2 700 apple computer
3 900 google phone

Grouping by a single column

To divide up the DataFrame using the categories of the brand column:

groups_brand = df.groupby("brand") # Returns a groupby object

We can look at the partitions using the property groups:

df.groupby("brand").groups
{'apple': Int64Index([0, 2], dtype='int64'),
'google': Int64Index([1, 3], dtype='int64')}

As we can see, the DataFrame was divided up into two groups: apple brand and google brand. The rows corresponding to apple brand are index 0 and index 2, whereas those corresponding to google brand are index 1 and 3.

There are many things you can do with this returned groupby object, such as computing the mean price of each brand:

groups_brand.mean() # Returns a Series
price
brand
apple 450
google 600

Notice how the mean of the price column was computed, while that of the device column was not, even though we did not explicitly specify the columns. This is because the aggregate functions only apply to numeric columns only.

Grouping by multiple columns

To determine the mean price of apple phones, apple computers, google phones and google computers:

df.groupby(["brand","device"]).mean()
price
brand device
apple computer 700
phone 200
google phone 600

Note that google computers do not exist in df, so that's why you don't see them in the output.

Just for your reference, we show df here again:

df
price brand device
0 200 apple phone
1 300 google phone
2 700 apple computer
3 900 google phone

Iterating through the groupby object

To iterate through all groups of a groupby object:

for group_name, group in df.groupby("brand"):
print("group_name:", group_name)
print(group) # DataFrame
group_name: apple
price brand device
0 200 apple phone
2 700 apple computer
group_name: google
price brand device
1 300 google phone
3 900 google phone

Using aggregate functions

To compute the mean price of apple and google devices:

df.groupby("brand").agg("mean") # Returns a DataFrame
price
brand
apple 450
google 600

Note that the groupby object has instance methods for common aggregates:

df.groupby("brand").mean()
price
brand
apple 450
google 600

You can also pass in a list of functions for agg to compute multiple aggregates:

df.groupby("brand").agg(["mean", np.max])
price
mean amax
brand
apple 450 700
google 600 900

Click here for more information about aggregate functions.

Applying transformation group-wise

Here is df again for your reference:

df
price brand device
0 200 apple phone
1 300 google phone
2 700 apple computer
3 900 google phone

To apply a transformation group-wise, use transform(~):

df.groupby("brand").transform(lambda col: col - col.max())
price
0 -500
1 -600
2 0
3 0

Note the following:

  • this is a completely arbitrary example where we are shifting each value by the maximum of the group it belongs in.

  • we obtain -500 in index 0 because the maximum price of the group brand=apple is 700, and so 200-700=-500.

  • the argument passed into our function (col) is of type Series, and it represents a single column of a group.

  • the return type of the entire code snippet is a DataFrame.

On a more practical note, we often apply transformations group-wise for standardisation.

Including only a subset of columns in the returned result

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})
df
pricing rating brand
0 200 3 apple
1 300 4 google
2 700 5 apple
3 900 3 google

We have two numerical columns here: price and rating. By default, calling aggregates like mean() results in computing the aggregates for all numeric columns. For instance:

df.groupby("brand").mean() # Returns a DataFrame
price rating
brand
apple 450.0 4.0
google 600.0 3.5

To compute aggregates for only the price column, you can use [] notation directly on the groupby object:

df.groupby("brand")["price"].mean() # Returns a Series
brand
apple 450
google 600
Name: price, dtype: int64

Using keyword arguments to name columns

When using the agg method, you can assign column labels to the resulting DataFrame by providing a keyword argument:

df.groupby("brand")["price"].agg(mean_price="mean")
mean_price
brand
apple 450
google 600

For this to work, you must specify which column to aggregate ("price" in this case).

Specifying as_index

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["C","B","A","B"]}, index=["a","b","c","d"])
df
price brand
a 200 C
b 300 B
c 700 A
d 900 B

By default, as_index=True, which means that the group labels will be used as the index of the resulting DataFrame:

df.groupby("brand", as_index=True).mean()
price
brand
A 700
B 600
C 200

Setting as_index=False would set the group labels as a column instead:

df.groupby("brand", as_index=False).mean()
brand price
0 A 700
1 B 600
2 C 200

Notice how the index is the default integer index ([0,1,2]).

Specifying sort

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "brand":["C","B","A","B"]}, index=["a","b","c","d"])
df
price brand
a 200 C
b 300 B
c 700 A
d 900 B

By default, sort=True, which means that the group labels will be sorted:

df.groupby("brand", sort=True).mean()
price
brand
A 700
B 600
C 200

Notice how the index of the resulting DataFrame is sorted in ascending order.

If we don't need this behaviour, then, set sort=False like so:

df.groupby("brand", sort=False).mean()
price
brand
C 200
B 600
A 700

Specifying group_keys

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])
df
price brand
a 200 apple
b 300 google
c 700 apple
d 900 google

By default, group_keys=True, which means that group names will be in the index of the resulting DataFrame:

df.groupby("brand", group_keys=True).apply(lambda x: x.reset_index())
index price brand
brand
apple 0 a 200 apple
1 c 700 apple
google 0 b 300 google
1 d 900 google

Note that group_keys takes effect only when the function we apply changes the index of the resulting DataFrame, just as with reset_index(~).

Setting group_keys=False would remove the group names from the resulting DataFrame:

df.groupby("brand", group_keys=False).apply(lambda x: x.reset_index())
index price brand
0 a 200 apple
1 c 700 apple
0 b 300 google
1 d 900 google

Notice how the brands are no longer present in the index.

Specifying squeeze

By default, squeeze=False, which means that the return type will not be simplified even when possible.

Consider the following DataFrame:

df = pd.DataFrame({"A":[2,3]})
df
A
0 2
1 3

This is a completely arbitrary example, but suppose we group by column A, and then for each group, apply a function that literally just returns a Series containing "b":

df.groupby("A").apply(lambda x: pd.Series(["b"]))
0
A
2 b
3 b

Here, the return type is DataFrame, which could be simplified down to a Series.

To simplify our return type to Series, set squeeze=True in groupby(~) like so:

df.groupby("A", squeeze=True).apply(lambda x: pd.Series(["b"]))
0 b
0 b
dtype: object

Here, the return type is Series.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...