search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Pandas | crosstab method

schedule Aug 12, 2023
Last updated
local_offer
PandasPython
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Pandas crosstab(~) method computes the cross tabulation of multiple column variables. By default, the computed statistic is the count. See examples for clarification.

Parameters

1. indexlink | array-like

The values used as the index of the resulting DataFrame.

2. columnslink | array-like

The columns of the resulting DataFrame.

3. valueslink | array-like | optional

The values used to compute the statistic specified by agg_func.

4. rownameslink | sequence | optional

The names(s) assigned to the rows in the resulting DataFrame. By default, rownames=None.

5. colnameslink | sequence | optional

The name(s) assigned to the columns in the resulting DataFrame. By default, colnames=None.

6. aggfunclink | function | optional

The function that defined the statistic to compute. The input of this function is values, and return value must be the aggregated statistic (e.g. the mean, max and so on).

7. marginslink | boolean | optional

Whether to include an additional row and column that indicates the sum of cells row-wise and column-wise. By default, margins=False.

8. margin_nameslink | string | optional

If margins is set to True, then margin_names is the name of the newly added row and column.

9. dropna | boolean | optional

Whether or not to drop columns that contain just NaN. By default, dropna=False.

10. normalizelink | boolean or string | optional

Whether or not to normalize the computed values by dividing by their sum:

Value

Description

"all" or True

Normalize over all values.

"index"

Normalize row-wise.

"columns"

Normalize column-wise

False

Do not normalize.

By default, noramlize=False.

Return Value

A DataFrame that represents the cross tabulation of multiple factors.

Examples

Consider the following DataFrame:

df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[50,20,40,20]})
df
name gender age
0 alex male 50
1 bob male 20
2 cathy female 40
3 doge male 20

Basic usage

To get the count of the males and females of different ages:

pd.crosstab(index=df["age"], columns=df["gender"])
gender female male
age
20 0 2
40 1 0
50 0 1

Here, note the following:

  • we see a breakdown of the counts of each profile. For instance, we have 2 males who are aged 20.

  • the label gender is the name assigned to the columns.

Specifying rownames and colnames

To give specific names to the resulting rows and columns:

pd.crosstab(index=df["age"], columns=df["gender"], rownames=["AGE"], colnames=["GENDER"])
GENDER female male
AGE
20 0 2
40 1 0
50 0 1

Here, we show the same df for your reference:

df
name gender age
0 alex male 50
1 bob male 20
2 cathy female 40
3 doge male 20

Specifying aggfunc

Instead of the count, we could compute other statistics as well. For instance, to compute the mean age of each gender:

pd.crosstab(index=df["gender"], columns=["AGE"], values=df["age"], aggfunc=np.mean)
col_0 AGE
gender
female 40
male 30

Notes the following:

  • we are using Numpy's mean(~) function.

  • if you specify aggfunc, then values must also be specified, and vice versa.

Specifying margins and margins_name

We can compute the total sum of the statistics row-wise and column-wise by specifying margins=True:

pd.crosstab(index=df["age"], columns=df["gender"], margins=True)
gender female male All
age
20 0 2 2
40 1 0 1
50 0 1 1
All 1 3 4

You can also rename the newly introduced column and row by setting margins_name:

pd.crosstab(index=df["age"], columns=df["gender"], margins=True, margins_name="TOTAL")
gender female male TOTAL
age
20 0 2 2
40 1 0 1
50 0 1 1
TOTAL 1 3 4

Specifying dropna

Consider the following DataFrame with a missing value:

df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[30,20,np.NaN,30]})
df
name gender age
0 alex male 30.0
1 bob male 20.0
2 cathy female NaN
3 doge male 30.0

Computing the cross tabulation using age and gender:

pd.crosstab(index=df["age"], columns=df["gender"])
gender male
age
20.0 1
30.0 2

Notice how the column for female, whose values should be filled with NaN, is not there. This is the default behaviour of dropna=True.

We can choose to keep the column by setting dropna=False like so:

pd.crosstab(index=df["age"], columns=df["gender"], dropna=False)
gender female male
age
20.0 0 1
30.0 0 2

Notice how the NaN have been filled using 0s.

Specifying normalize

Consider the following DataFrame:

df = pd.DataFrame({"name":["alex","bob","cathy","doge"], "gender":["male","male","female","male"], "age":[50,20,40,20]})
df
name gender age
0 alex male 50
1 bob male 20
2 cathy female 40
3 doge male 20

By default, normalize=False, which means the resulting values are not normalized:

pd.crosstab(index=df["age"], columns=df["gender"])
gender female male
age
20 0 2
40 1 0
50 0 1

In this context, to normalize is to divide each value by the sum of the values.

True or "all"

To normalize using all values, simply set normalize=True like so:

pd.crosstab(index=df["age"], columns=df["gender"], normalize=True) # or normalize="all"
gender female male
age
20 0.00 0.50
40 0.25 0.00
50 0.00 0.25

Here, we get a 0.50 there because 2/(2+1+1)=0.5.

"rows"

To normalize row-wise, set normalize="index" like so:

pd.crosstab(index=df["age"], columns=df["gender"], normalize="index")
gender female male
age
20 0.0 1.0
40 1.0 0.0
50 0.0 1.0

"columns"

To normalize column-wise, set normalize="columns" like so:

pd.crosstab(index=df["age"], columns=df["gender"], normalize="columns")
gender female male
age
20 0.0 0.666667
40 1.0 0.000000
50 0.0 0.333333
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...