The function that defined the statistic to compute. The input of this function is values, and return value must be the aggregated statistic (e.g. the mean, max and so on).

7. marginslink | boolean | optional

Whether to include an additional row and column that indicates the sum of cells row-wise and column-wise. By default, margins=False.

8. margin_nameslink | string | optional

If margins is set to True, then margin_names is the name of the newly added row and column.

9. dropna | boolean | optional

Whether or not to drop columns that contain just NaN. By default, dropna=False.

10. normalizelink | boolean or string | optional

Whether or not to normalize the computed values by dividing by their sum:

Value	Description
`"all"` or `True`	Normalize over all values.
`"index"`	Normalize row-wise.
`"columns"`	Normalize column-wise
`False`	Do not normalize.

By default, noramlize=False.

Return Value

A DataFrame that represents the cross tabulation of multiple factors.

Examples

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[50,20,40,20]})
df
                
            
               name   gender  age
0  alex   male    50
1  bob    male    20
2  cathy  female  40
3  doge   male    20

Basic usage

To get the count of the males and females of different ages:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"])
                
            
                gender  female   male
age		
20            0       2
40            1       0
50            0       1

Here, note the following:

we see a breakdown of the counts of each profile. For instance, we have 2 males who are aged 20.
the label gender is the name assigned to the columns.

Specifying rownames and colnames

To give specific names to the resulting rows and columns:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"], rownames=["AGE"], colnames=["GENDER"])
                
            
            GENDER  female  male
AGE		
20        0      2
40        1      0
50        0      1

Here, we show the same df for your reference:


        
        
            
                
                
                    df
                
            
               name   gender  age
0  alex   male    50
1  bob    male    20
2  cathy  female  40
3  doge   male    20

Specifying aggfunc

Instead of the count, we could compute other statistics as well. For instance, to compute the mean age of each gender:


        
        
            
                
                
                    pd.crosstab(index=df["gender"], columns=["AGE"], values=df["age"], aggfunc=np.mean)
                
            
            col_0   AGE
gender	
female  40
male    30

Notes the following:

we are using Numpy's mean(~) function.
if you specify aggfunc, then values must also be specified, and vice versa.

Specifying margins and margins_name

We can compute the total sum of the statistics row-wise and column-wise by specifying margins=True:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"], margins=True)
                
            
            gender  female  male  All
age      
20       0      2     2
40       1      0     1
50       0      1     1
All      1      3     4

You can also rename the newly introduced column and row by setting margins_name:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"], margins=True, margins_name="TOTAL")
                
            
            gender  female  male  TOTAL
age			
20        0      2      2
40        1      0      1
50        0      1      1
TOTAL     1      3      4

Specifying dropna

Consider the following DataFrame with a missing value:


        
        
            
                
                
                    df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[30,20,np.NaN,30]})
df
                
            
               name   gender  age
0  alex   male    30.0
1  bob    male    20.0
2  cathy  female  NaN
3  doge   male    30.0

Computing the cross tabulation using age and gender:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"])
                
            
            gender  male
age	
20.0     1
30.0     2

Notice how the column for female, whose values should be filled with NaN, is not there. This is the default behaviour of dropna=True.

We can choose to keep the column by setting dropna=False like so:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"], dropna=False)
                
            
            gender  female  male
age		
20.0      0      1
30.0      0      2

Notice how the NaN have been filled using 0s.

Specifying normalize

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[50,20,40,20]})
df
                
            
               name   gender  age
0  alex   male    50
1  bob    male    20
2  cathy  female  40
3  doge   male    20

By default, normalize=False, which means the resulting values are not normalized:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"])
                
            
            gender  female  male
age		
20        0      2
40        1      0
50        0      1

In this context, to normalize is to divide each value by the sum of the values.

True or "all"

To normalize using all values, simply set normalize=True like so:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"], normalize=True)   # or normalize="all"
                
            
            gender  female  male
age		
20      0.00    0.50
40      0.25    0.00
50      0.00    0.25

Here, we get a 0.50 there because 2/(2+1+1)=0.5.

"rows"

To normalize row-wise, set normalize="index" like so:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"], normalize="index")
                
            
            gender  female  male
age		
20       0.0    1.0
40       1.0    0.0
50       0.0    1.0

"columns"

To normalize column-wise, set normalize="columns" like so:


        
        
            
                
                
                    pd.crosstab(index=df["age"], columns=df["gender"], normalize="columns")
                
            
            gender  female  male
age		
20       0.0    0.666667
40       1.0    0.000000
50       0.0    0.333333

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official Pandas Documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!