Pandas | cut method
Start your free 7-days trial now!
Pandas cut(~)
method categorises numerical values into bins (intervals).
Parameters
1. x
link | array-like
A 1D input array whose numerical values will be segmented into bins.
2. bins
link | int
or sequence<scalar>
or IntervalIndex
The specified type of bins
determines how the bins are computed:
Type | Description |
---|---|
| The number of equal-width bins. The range of |
| The desired bin edges. Values that do no fall in a bin will be set to |
| The exact bins to use. |
3. right
link | boolean
| optional
Whether to make the left bin edge exclusive and the right bin edge inclusive. By default, right=True
.
4. labels
link | array
or False
| optional
The desired labels of the bins. By default, labels=None
.
5. retbins
link | boolean
| optional
Whether or not to return bins. By default, retbins=False
.
6. precision
link | int
| optional
The number of decimal places to include up until for the bin labels. By default, precision=3
.
7. include_lowest
link | boolean
| optional
Whether to make the left edge of the first bin inclusive. By default, include_lowest=False
.
8. duplicates
link | string
| optional
How to deal with duplicate bin edges:
Value | Description |
---|---|
| Throw an error if any duplicate bin edges are set. |
| Remove the duplicate bin edge and just keep one. |
By default, duplicates="raise"
.
9. ordered
link | boolean
| optional
| v1.10~
Whether or not to embed ordering information. This is only relevant if the return type is Categorical
or Series
of data-type Categorical
. ordered
can only be set to False
if labels
is provided. By default, ordered=True
.
Return Value
The return type depends on the type of the labels
parameter:
if
labels
is unspecified:if
labels
is an array of scalars:if
x
is aSeries
, then aSeries
is returned. The type of the values stored within thisSeries
matches the type of the values stored inlabels
.else, a
Categorical
is returned. The type of the values stored within theCategorical
matches the type of the values stored inlabels
.
if
labels
is a booleanFalse
, then a Numpy array of integers is returned.
If retbins=True
, then in addition to the above, the bins are returned as a Numpy array. If x
is an IntervalIndex
, then x
is returned instead.
Examples
Consider the following DataFrame about students and their grades:
raw_grades = [3,6,8,7,4,6]students = ["alex", "bob", "cathy", "doge", "eric", "fred"]df = pd.DataFrame({"name":students,"raw_grade":raw_grades})df
name raw_grade0 alex 31 bob 62 cathy 83 doge 74 eric 45 fred 6
Basic Usage
To categorise the raw grades into four bins (segments):
df["grade"] = pd.cut(df["raw_grade"], bins=4) # returns a Seriesdf
name raw_grade grade0 alex 3 (2.999, 4.5]1 bob 6 (4.5, 6.0]2 cathy 8 (6.75, 8.0]3 doge 7 (6.75, 8.0]4 eric 4 (2.999, 4.5]5 fred 6 (4.5, 6.0]
The grade
column now contains the bins, and there should be 4
different bins in total. Note that (2.995, 4.25]
just means that the 2.995 < raw_grade <= 4.25
.
Specifying custom bin edges
To specify custom bin edges, we can pass in an array of bin edges instead of an int
:
df["grade"] = pd.cut(df["raw_grade"], bins=[0,4,6,10])df
name raw_grade grade0 alex 3 (0, 4]1 bob 6 (4, 6]2 cathy 8 (6, 10]3 doge 7 (6, 10]4 eric 4 (0, 4]5 fred 6 (4, 6]
We show the same df
here for your reference:
df
name raw_grade0 alex 31 bob 62 cathy 83 doge 74 eric 45 fred 6
Specifying right
To make the left bin edge inclusive and the right bin edge exclusive, set right=False
:
df["grade"] = pd.cut(df["raw_grade"], bins=[0,4,6,10], right=False)df
name raw_grade grade0 alex 3 [0, 4)1 bob 6 [6, 10)2 cathy 8 [6, 10)3 doge 7 [6, 10)4 eric 4 [4, 6)5 fred 6 [6, 10)
Notice how we have [0, 4)
instead of the default (0, 4]
.
Specifying labels
We can give labels to our bins by setting the labels
parameter:
df["grade"] = pd.cut(df["raw_grade"], bins=3, labels=["C","B","A"])df
name raw_grade grade0 alex 3 C1 bob 6 B2 cathy 8 A3 doge 7 A4 eric 4 C5 fred 6 B
This is an extremely practical feature of the cut(~)
method. The length of the labels
array must equal the specified number of bins.
By setting labels=False
, a Numpy array of int
is returned:
raw_grades = [3,6,8,7,4,5]pd.cut(raw_grades, bins=3, labels=False)
array([0, 1, 2, 2, 0, 1])
Here, the output tells us that:
the raw grade
3
belongs to bin0
(first bin).the raw grade
6
belongs to bin1
(second bin).and so on.
Specifying retbins
To get the computed bin edges as well, set retbins=True
:
raw_grades = [3,6,8,7,4,5]res = pd.cut(raw_grades, bins=2, retbins=True)print("Categories: ", res[0])print("Bin egdes: ", res[1])
Categories: [(2.995, 5.5], (5.5, 8.0], (5.5, 8.0], (5.5, 8.0], (2.995, 5.5], (2.995, 5.5]]Categories (2, interval[float64]): [(2.995, 5.5] < (5.5, 8.0]]Bin egdes: [2.995 5.5 8. ]
We show the same df
here for your reference:
df
name raw_grade0 alex 31 bob 62 cathy 83 doge 74 eric 45 fred 6
Specifying precision
To control how many decimal places are displayed, set the precision
parameter:
res = pd.cut(df["raw_grade"], bins=[0,4.33333,6.6,10], precision=2)print(res)
0 (0.0, 4.33]1 (4.33, 6.6]2 (6.6, 10.0]3 (6.6, 10.0]4 (0.0, 4.33]5 (4.33, 6.6]Name: raw_grade, dtype: categoryCategories (3, interval[float64]): [(0.0, 4.33] < (4.33, 6.6] < (6.6, 10.0]]
Here, notice how 4.3333
got truncated to 4.33
, as specified by precision
value of 2
.
Specifying include_lowest
Consider the following:
df["grade"] = pd.cut(df["raw_grade"], bins=[3,6,10])df
name raw_grade grade0 alex 3 NaN1 bob 6 (3.0, 6.0]2 ...
By default, include_lowest=False
, which means that the first bin interval is left-exclusive. This is why the raw_grade of 3
does not fall in any bin here.
We can make the first bin interval left-inclusive by setting include_lowest=True
:
df["grade"] = pd.cut(df["raw_grade"], bins=[3,6,10], include_lowest=True)df
name raw_grade grade0 alex 3 (2.999, 6.0]1 bob 6 (2.999, 6.0]...
We now see that the raw_grade
of 3
has been included in the first bin.
Specifying duplicates
By default, the bin edges must be unique, otherwise an error will be thrown. For instance:
x = [3,7,8,7,4,5]pd.cut(x, bins=[2,6,6,10]) # duplicates="raise"
ValueError: Bin edges must be unique: array([ 2, 6, 6, 10]).
Here, we have two bin edges of value 6
, so that's why we get an error.
In order to drop (remove) redundant bin edges, set duplicates="drop"
, like so:
x = [3,7,8,7,4,5]pd.cut(x, bins=[2,6,6,10], duplicates="drop")
[(2, 6], (6, 10], (6, 10], (6, 10], (2, 6], (2, 6]]Categories (2, interval[int64]): [(2, 6] < (6, 10]]
We see that one of the bin edge of value 6
got dropped.
Specifying ordered
By default, ordered=True
, which means that the resulting Categorical
will be ordered:
grades = [3,6,8,7,4,5]pd.cut(grades, bins=2, labels=["B","A"]) # ordered=True
['B', 'A', 'A', 'A', 'B', 'B']Categories (2, object): ['B' < 'A']
Notice how the information about ordering is embedded as ['B'<'A']
.
By setting ordered=False
, such ordering information is omitted:
grades = [3,6,8,7,4,5]pd.cut(grades, bins=2, labels=["B","A"], ordered=False)
['B', 'A', 'A', 'A', 'B', 'B']Categories (2, object): ['B', 'A']
To set ordered=False
, make sure to have specified labels
.