search
Search
Join our weekly DS/ML newsletter layers DS/ML Guides
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

Pandas | crosstab method

Pandas
chevron_right
Documentation
chevron_right
General Functions
schedule Jul 1, 2022
Last updated
local_offer PandasPython
Tags

Pandas crosstab(~) method computes the cross tabulation of multiple column variables. By default, the computed statistic is the count. See examples for clarification.

Parameters

1. indexlink | array-like

The values used as the index of the resulting DataFrame.

2. columnslink | array-like

The columns of the resulting DataFrame.

3. valueslink | array-like | optional

The values used to compute the statistic specified by agg_func.

4. rownameslink | sequence | optional

The names(s) assigned to the rows in the resulting DataFrame. By default, rownames=None.

5. colnameslink | sequence | optional

The name(s) assigned to the columns in the resulting DataFrame. By default, colnames=None.

6. aggfunclink | function | optional

The function that defined the statistic to compute. The input of this function is values, and return value must be the aggregated statistic (e.g. the mean, max and so on).

7. marginslink | boolean | optional

Whether to include an additional row and column that indicates the sum of cells row-wise and column-wise. By default, margins=False.

8. margin_nameslink | string | optional

If margins is set to True, then margin_names is the name of the newly added row and column.

9. dropna | boolean | optional

Whether or not to drop columns that contain just NaN. By default, dropna=False.

10. normalizelink | boolean or string | optional

Whether or not to normalize the computed values by dividing by their sum:

Value

Description

Normalize over all values.

"all" or True

"index"

Normalize row-wise.

"columns"

Normalize column-wise

False

Do not normalize.

By default, noramlize=False.

Return Value

A DataFrame that represents the cross tabulation of multiple factors.

Examples

Consider the following DataFrame:

df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[50,20,40,20]})
df
name gender age
0 alex male 50
1 bob male 20
2 cathy female 40
3 doge male 20

Basic usage

To get the count of the males and females of different ages:

pd.crosstab(index=df["age"], columns=df["gender"])
gender female male
age
20 0 2
40 1 0
50 0 1

Here, note the following:

  • we see a breakdown of the counts of each profile. For instance, we have 2 males who are aged 20.

  • the label gender is the name assigned to the columns.

Specifying rownames and colnames

To give specific names to the resulting rows and columns:

pd.crosstab(index=df["age"], columns=df["gender"], rownames=["AGE"], colnames=["GENDER"])
GENDER female male
AGE
20 0 2
40 1 0
50 0 1

Here, we show the same df for your reference:

df
name gender age
0 alex male 50
1 bob male 20
2 cathy female 40
3 doge male 20

Specifying aggfunc

Instead of the count, we could compute other statistics as well. For instance, to compute the mean age of each gender:

pd.crosstab(index=df["gender"], columns=["AGE"], values=df["age"], aggfunc=np.mean)
col_0 AGE
gender
female 40
male 30

Notes the following:

  • we are using Numpy's mean(~) function.

  • if you specify aggfunc, then values must also be specified, and vice versa.

Specifying margins and margins_name

We can compute the total sum of the statistics row-wise and column-wise by specifying margins=True:

pd.crosstab(index=df["age"], columns=df["gender"], margins=True)
gender female male All
age
20 0 2 2
40 1 0 1
50 0 1 1
All 1 3 4

You can also rename the newly introduced column and row by setting margins_name:

pd.crosstab(index=df["age"], columns=df["gender"], margins=True, margins_name="TOTAL")
gender female male TOTAL
age
20 0 2 2
40 1 0 1
50 0 1 1
TOTAL 1 3 4

Specifying dropna

Consider the following DataFrame with a missing value:

df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[30,20,np.NaN,30]})
df
name gender age
0 alex male 30.0
1 bob male 20.0
2 cathy female NaN
3 doge male 30.0

Computing the cross tabulation using age and gender:

pd.crosstab(index=df["age"], columns=df["gender"])
gender male
age
20.0 1
30.0 2

Notice how the column for female, whose values should be filled with NaN, is not there. This is the default behaviour of dropna=True.

We can choose to keep the column by setting dropna=False like so:

pd.crosstab(index=df["age"], columns=df["gender"], dropna=False)
gender female male
age
20.0 0 1
30.0 0 2

Notice how the NaN have been filled using 0s.

Specifying normalize

Consider the following DataFrame:

df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[50,20,40,20]})
df
name gender age
0 alex male 50
1 bob male 20
2 cathy female 40
3 doge male 20

By default, normalize=False, which means the resulting values are not normalized:

pd.crosstab(index=df["age"], columns=df["gender"])
gender female male
age
20 0 2
40 1 0
50 0 1

In this context, to normalize is to divide each value by the sum of the values.

True or "all"

To normalize using all values, simply set normalize=True like so:

pd.crosstab(index=df["age"], columns=df["gender"], normalize=True) # or normalize="all"
gender female male
age
20 0.00 0.50
40 0.25 0.00
50 0.00 0.25

Here, we get a 0.50 there because 2/(2+1+1)=0.5.

"rows"

To normalize row-wise, set normalize="index" like so:

pd.crosstab(index=df["age"], columns=df["gender"], normalize="index")
gender female male
age
20 0.0 1.0
40 1.0 0.0
50 0.0 1.0

"columns"

To normalize column-wise, set normalize="columns" like so:

pd.crosstab(index=df["age"], columns=df["gender"], normalize="columns")
gender female male
age
20 0.0 0.666667
40 1.0 0.000000
50 0.0 0.333333
mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...