Pandas | pivot_table method
Start your free 7-days trial now!
Pandas pivot_table(~)
method converts the input DataFrame into what is called a pivot table. The concept of pivot tables is best explained using examples, so check them out down below.
Parameters
1. data
link | DataFrame
The DataFrame from which to create the pivot table.
2. values
link | string
or list<string>
| optional
The label of the columns whose values will fill the resulting pivot table. These values will be aggregated using the specified aggfunc
, which just means that we compute a statistic (e.g. sum, average, max and so on) to summarise these values.
3. index
link | string
or list<string>
| optional
The label of the columns that will become the index of the resulting pivot table.
4. aggfunc
link | function
or list<function>
| optional
The function used to aggregate the values. By default, aggfunc=np.mean
.
5. fill_value
link | scalar
| optional
The value to replace NaN
. By default, fill_value=None
.
6. margins
link | boolean
| optional
Whether or not to add a new row and column that shows the sum of the values. By default, margins=False
.
7. dropna
link | boolean
| optional
Whether or not to remove entries with NaN
. By default, dropna=True
.
8. margins_name
link | string
| optional
The label assigned to the new row and column that contain the sum of the values. This is only relevant when margins=True
. By default, margins_name="All"
.
Return Value
A new DataFrame that represents a pivot table.
Examples
Basic usage
Consider the following DataFrame about the bonus received by employees:
df = pd.DataFrame({"name":["alice","alice","bob","alice","bob","bob"], "year":[2012,2012,2012,2013,2013,2013], "bonus":[10,15,15,10,10,20]})df
name year bonus0 alice 2012 101 alice 2012 152 bob 2012 153 alice 2013 104 bob 2013 105 bob 2013 20
Our goal is to compute the total bonus of each employee per year. We can do this by creating what is called a pivot table using the pivot_table(~)
method:
pd.pivot_table(df, values="bonus", index="name", columns="year", aggfunc=np.sum)
year 2012 2013name alice 25 10bob 15 30
Note the following:
the values of column
bonus
is used to fill the resulting DataFrame.the
name
column is assigned as the new index.the values of the
year
column (2012
and2013
) became the new column labels.the
aggfunc
is summation, as indicated by NumPy'ssum(~)
method.
Multiple aggregates
We could also compute multiple aggregates at once. In addition to the sum, let's also compute the average bonus of each employee per year:
pd.pivot_table(df, values="bonus", index="name", columns="year", aggfunc=[np.sum, np.mean])
sum meanyear 2012 2013 2012 2013name alice 25 10 12.5 10.0bob 15 30 15.0 15.0
Notice how we had to pass in a list of aggregate functions for aggfunc
here.
Multiple values
Suppose we want to aggregate two columns this time. To demonstrate this, here's a DataFrame with a new column indicating the number of days absent of each employee per year:
df = pd.DataFrame({"name":["alice","alice","bob","alice","bob","bob"], "year":[2012,2012,2012,2013,2013,2013], "bonus":[10,15,15,10,10,20], "days_absent":[1,2,5,2,3,5]})df
name year bonus days_absent0 alice 2012 10 11 alice 2012 15 22 bob 2012 15 53 alice 2013 10 24 bob 2013 10 35 bob 2013 20 5
To compute the total bonus as well as the days absent per year for each employee:
pd.pivot_table(df, values=["bonus","days_absent"], index="name", columns="year", aggfunc=np.sum)
bonus days_absentyear 2012 2013 2012 2013name alice 25 10 3 2bob 15 30 5 8
Notice how we passed in an array of column labels for values
here.
Multiple indices
Suppose we want to use two columns as the index of the resulting DataFrame. To demonstrate this, here's a DataFrame with employees now identified by their first and last names:
df = pd.DataFrame({"first_name":["alice","alice","bob","alice","bob","bob"], "last_name":["A","B","A","A","A","B"], "year":[2012,2012,2012,2013,2013,2013], "bonus":[10,15,15,10,10,20]})df
first_name last_name year bonus0 alice A 2012 101 alice B 2012 152 bob A 2012 153 alice A 2013 104 bob A 2013 105 bob B 2013 20
Now, alice A
and alice B
represent two different employees, so it would make sense to differentiate them when computing the total bonus per year for each employee. We do this by passing in a list for the index
parameter:
pd.pivot_table(df, values=["bonus"], index=["first_name", "last_name"], columns="year", aggfunc=np.sum)
bonus year 2012 2013first_name last_name alice A 10.0 10.0 B 15.0 NaNbob A 15.0 10.0 B NaN 20.0
We get NaN
for the total bonus of alice B
in 2013
since it does not exist in df
.
Filling missing values
Consider the following DataFrame:
df = pd.DataFrame({"name":["alice","bob","bob"], "year":[2012,2012,2013], "bonus":[10,15,20]})
name year bonus0 alice 2012 101 bob 2012 152 bob 2013 20
Notice how we don't have data about Alice's bonus in 2013
.
By default, missing values will be indicated by NaN
in the resulting pivot table:
pd.pivot_table(df, values="bonus", index="name", columns="year", aggfunc=np.sum)
year 2012 2013name alice 10.0 NaNbob 15.0 20.0
We can choose to fill these NaN
with our own values by passing in the fill_value
parameter:
pd.pivot_table(df, values="bonus", index="name", columns="year", aggfunc=np.sum, fill_value=-1)
year 2012 2013name alice 10 -1bob 15 20
Specifying margin and margins_name
Consider the following DataFrame:
df = pd.DataFrame({"name":["alice","bob","alice","bob"], "year":[2012,2012,2013,2013], "bonus":[10,15,20,30]})df
name year bonus0 alice 2012 101 bob 2012 152 alice 2013 203 bob 2013 30
By setting margins=True
, we obtain an additional row and column that holds the total sum of the values:
pd.pivot_table(df, values="bonus", index="name", columns="year", aggfunc=np.sum, margins=True)
year 2012 2013 Allname alice 10 20 30bob 15 30 45All 25 50 75
Note the following:
the row sum tells us the total bonus earned by each employee
the column sum is the total bonus earned per year of all employees
the bottom-right entry tells us the total bonus earned over the 2 years of all employees
By default, these newly introduced row and column are assigned the label "All"
. We can assign our own label using the margins_name
parameter:
pd.pivot_table(df, ~~~, margins=True, margins_name="Total")
year 2012 2013 Totalname alice 10 20 30bob 15 30 45Total 25 50 75
Specifying dropna
Consider the following DataFrame:
df = pd.DataFrame({"name":[np.NaN,np.NaN,"alice","bob"], "year":[2012,2012,2013,2013], "bonus":[10,15,20,25]})df
name year bonus0 NaN 2012 101 NaN 2012 152 alice 2013 203 bob 2013 25
Here, we're missing the names of the employee for the year 2012
. This means that the resulting pivot table would have column corresponding to 2012
filled with NaN
.
By default, dropna=True
, which means that columns whose entries are all NaN
in the resulting pivot table will be removed:
pd.pivot_table(df, values="bonus", index="name", columns="year", aggfunc=np.sum)
year 2013name alice 20bob 25
We can choose to keep the NaN
by setting dropna=False
like so:
pd.pivot_table(df, values="bonus", index="name", columns="year", aggfunc=np.sum, dropna=False)
year 2012 2013name alice NaN 20bob NaN 25