Whether or not to sort the groups by group labels. By default, sort=True. For performance gains, consider passing False when this behaviour is not needed.

6. group_keyslink | boolean | optional

Whether or not to include the group labels in the index if the aggregation function we apply changes the index. Check examples below for clarification. By default, group_keys=True.

7. squeezelink | boolean | optional

Whether or not to return a simplified type if possible. Consult examples below for clarification, By default, squeeze=False.

Return Value

A DataFrameGroupBy object.

Examples

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})
df
                
            
               price  brand   device
0  200    apple   phone
1  300    google  phone
2  700    apple   computer
3  900    google  phone

Grouping by a single column

To divide up the DataFrame using the categories of the brand column:


        
        
            
                
                
                    groups_brand = df.groupby("brand")      # Returns a groupby object

We can look at the partitions using the property groups:


        
        
            
                
                
                    df.groupby("brand").groups
                
            
            {'apple': Int64Index([0, 2], dtype='int64'),
 'google': Int64Index([1, 3], dtype='int64')}

As we can see, the DataFrame was divided up into two groups: apple brand and google brand. The rows corresponding to apple brand are index 0 and index 2, whereas those corresponding to google brand are index 1 and 3.

There are many things you can do with this returned groupby object, such as computing the mean price of each brand:


        
        
            
                
                
                    groups_brand.mean()   # Returns a Series
                
            
                    price
brand
apple   450
google  600

Notice how the mean of the price column was computed, while that of the device column was not, even though we did not explicitly specify the columns. This is because the aggregate functions only apply to numeric columns only.

Grouping by multiple columns

To determine the mean price of apple phones, apple computers, google phones and google computers:


        
        
            
                
                
                    df.groupby(["brand","device"]).mean()
                
            
                                price
brand    device	
apple    computer   700
         phone      200
google   phone      600

Note that google computers do not exist in df, so that's why you don't see them in the output.

Just for your reference, we show df here again:


        
        
            
                
                
                    df
                
            
               price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

Iterating through the groupby object

To iterate through all groups of a groupby object:


        
        
            
                
                
                    for group_name, group in df.groupby("brand"):
   print("group_name:", group_name)
   print(group)   # DataFrame
                
            
            group_name: apple
   price  brand  device
0   200   apple  phone
2   700   apple  computer
group_name: google
   price  brand   device
1   300   google  phone
3   900   google  phone

Using aggregate functions

To compute the mean price of apple and google devices:


        
        
            
                
                
                    df.groupby("brand").agg("mean")   # Returns a DataFrame
                
            
                    price
brand
apple   450
google  600

Note that the groupby object has instance methods for common aggregates:


        
        
            
                
                
                    df.groupby("brand").mean()
                
            
                    price
brand	
apple   450
google  600

You can also pass in a list of functions for agg to compute multiple aggregates:


        
        
            
                
                
                    df.groupby("brand").agg(["mean", np.max])
                
            
                    price
        mean  amax
brand      
apple   450   700
google  600   900

Click here for more information about aggregate functions.

Applying transformation group-wise

Here is df again for your reference:


        
        
            
                
                
                    df
                
            
               price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

To apply a transformation group-wise, use transform(~):


        
        
            
                
                
                    df.groupby("brand").transform(lambda col: col - col.max())
                
            
               price
0  -500
1  -600
2   0
3   0

Note the following:

this is a completely arbitrary example where we are shifting each value by the maximum of the group it belongs in.
we obtain -500 in index 0 because the maximum price of the group brand=apple is 700, and so 200-700=-500.
the argument passed into our function (col) is of type Series, and it represents a single column of a group.
the return type of the entire code snippet is a DataFrame.

On a more practical note, we often apply transformations group-wise for standardisation.

Including only a subset of columns in the returned result

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})
df
                
            
               pricing  rating  brand
0   200       3     apple
1   300       4     google
2   700       5     apple
3   900       3     google

We have two numerical columns here: price and rating. By default, calling aggregates like mean() results in computing the aggregates for all numeric columns. For instance:


        
        
            
                
                
                    df.groupby("brand").mean()     # Returns a DataFrame
                
            
                     price   rating
brand		
apple    450.0   4.0
google   600.0   3.5

To compute aggregates for only the price column, you can use [] notation directly on the groupby object:


        
        
            
                
                
                    df.groupby("brand")["price"].mean()   # Returns a Series
                
            
            brand
apple     450
google    600
Name: price, dtype: int64

Using keyword arguments to name columns

When using the agg method, you can assign column labels to the resulting DataFrame by providing a keyword argument:


        
        
            
                
                
                    df.groupby("brand")["price"].agg(mean_price="mean")
                
            
                    mean_price
brand   
apple      450
google     600

For this to work, you must specify which column to aggregate ("price" in this case).

Specifying as_index

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"price":[200,300,700,900], "brand": ["C","B","A","B"]}, index=["a","b","c","d"])
df
                
            
                price   brand
a    200     C
b    300     B
c    700     A
d    900     B

By default, as_index=True, which means that the group labels will be used as the index of the resulting DataFrame:


        
        
            
                
                
                    df.groupby("brand", as_index=True).mean()
                
            
                   price
brand   
A       700
B       600
C       200

Setting as_index=False would set the group labels as a column instead:


        
        
            
                
                
                    df.groupby("brand", as_index=False).mean()
                
            
                brand   price
0     A     700
1     B     600
2     C     200

Notice how the index is the default integer index ([0,1,2]).

Specifying sort

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"price":[200,300,700,900], "brand":["C","B","A","B"]}, index=["a","b","c","d"])
df
                
            
               price  brand
a  200      C
b  300      B
c  700      A
d  900      B

By default, sort=True, which means that the group labels will be sorted:


        
        
            
                
                
                    df.groupby("brand", sort=True).mean()
                
            
                    price
brand   
A       700
B       600
C       200

Notice how the index of the resulting DataFrame is sorted in ascending order.

If we don't need this behaviour, then, set sort=False like so:


        
        
            
                
                
                    df.groupby("brand", sort=False).mean()
                
            
                   price
brand   
C       200
B       600
A       700

Specifying group_keys

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])
df
                
            
               price  brand
a  200    apple
b  300    google
c  700    apple
d  900    google

By default, group_keys=True, which means that group names will be in the index of the resulting DataFrame:


        
        
            
                
                
                    df.groupby("brand", group_keys=True).apply(lambda x: x.reset_index())
                
            
                       index   price   brand
brand            
apple   0    a     200     apple
        1    c     700     apple
google  0    b     300     google
        1    d     900     google

Note that group_keys takes effect only when the function we apply changes the index of the resulting DataFrame, just as with reset_index(~).

Setting group_keys=False would remove the group names from the resulting DataFrame:


        
        
            
                
                
                    df.groupby("brand", group_keys=False).apply(lambda x: x.reset_index())
                
            
                index   price   brand
0     a     200     apple
1     c     700     apple
0     b     300     google
1     d     900     google

Notice how the brands are no longer present in the index.

Specifying squeeze

By default, squeeze=False, which means that the return type will not be simplified even when possible.

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"A":[2,3]})
df
                
            
               A
0  2
1  3

This is a completely arbitrary example, but suppose we group by column A, and then for each group, apply a function that literally just returns a Series containing "b":


        
        
            
                
                
                    df.groupby("A").apply(lambda x: pd.Series(["b"]))
                
            
               0
A
2  b
3  b

Here, the return type is DataFrame, which could be simplified down to a Series.

To simplify our return type to Series, set squeeze=True in groupby(~) like so:


        
        
            
                
                
                    df.groupby("A", squeeze=True).apply(lambda x: pd.Series(["b"]))
                
            
            0    b
0    b
dtype: object

Here, the return type is Series.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official Pandas Documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!