Pandas DataFrame | interpolate method
Start your free 7-days trial now!
Pandas DataFrame.interpolate(~)
method fills NaN
using interpolated values.
Parameters
1. method
| string
| linear
The algorithm used for interpolation:
"linear"
: simple linear interpolation."time"
: interpolation using DatetimeIndex."index"
or"values"
: use the index to perform interpolation. See example below."pad"
: use either the previous or next non-NaN
value to fill. The direction can be set usinglimit_direction
.
In addition, you can also use the interpolation methods available for scipy.interpolate.interp1d
:
nearest, zero, slinear, quadratic, cubic, spline, barycentric, polynomial
Some of these methods require a argument to be passed, which you can do using **kwargs
like so:
df.interpolate(method="polynomial", order=5)
2. axis
| int
or string
| optional
Whether to interpolate each row or column:
Axis | Description |
---|---|
| Interpolate each column |
| Interpolate each row |
By default, axis=0
.
3. limit
| int
| optional
The maximum number (inclusive) of consecutive NaN
to fill. For instance, if limit=3
, and there are 3
consecutive NaN
s, then filling will be performed on the first two NaN
s, and the third will be left as is.
4. inplace
| boolean
| optional
If
True
, then the method will directly modify the source DataFrame instead of creating a new DataFrame.If
False
, then a new DataFrame will be created and returned.
By default, inplace=False
.
5. limit_direction
| string
| optional
The fill direction of NaN
:
"forward"
: use the previous non-NaN
value to fill"backward"
: use the next non-NaN
value to fill"both"
: use the next non-NaN
value to fill if previous non-NaN
value is unavailable, and vice versa.
This is only relevant if limit
is specified. By default, limit_direction="forward"
.
6. limit_area
| None
or string
| optional
The restriction imposed on filling:
None
: no restriction."inside"
: only perform interpolation (i.e. when lower and upper bounds of the interval are defined)"outside"
: only perform extrapolation (i.e. when only one bound of the interval is defined)
By default, limit_area=None
.
7. downcast
| "infer"
or None
| optional
Whether or not to downcast the resulting dtypes. By default, downcast=None
.
8. **kwargs
The keyword arguments to pass on to method
.
Return value
A DataFrame with the NaN
filled with interpolated values.
Examples
Basic usage
Consider the following DataFrame:
df = pd.DataFrame({"A":[3,np.nan,5,6],"B":[1,5,np.nan,9],"C":[1,5,np.nan,np.nan]})df
A B C0 3.0 1.0 1.01 NaN 5.0 5.02 5.0 NaN NaN3 6.0 9.0 NaN
To fill NaN
using linear interpolation:
df.interpolate() # method="linear"
A B C0 3.0 1.0 1.01 4.0 5.0 5.02 5.0 7.0 5.03 6.0 9.0 5.0
Notice how the two NaN
in column C
were filled using forward-fill (default) instead since linear interpolation cannot be performed without an upper bound.
Interpolating row-wise
To interpolate row-wise, pass in axis=1
like so:
df.interpolate(axis=1)
A B C0 3.0 1.0 1.01 NaN 5.0 5.02 5.0 5.0 5.03 6.0 9.0 9.0
Interpolating using method=index
Consider the following DataFrame
df = pd.DataFrame({"B":[5,np.nan,9]}, index=[5,10,30])df
B5 5.010 NaN30 9.0
Performing simple linear interpolation yields:
df.interpolate() # method="linear"
B5 5.010 7.030 9.0
Here, we get a 7
as the interpolated value because the difference between the lower and upper bound (4
) is split up into 2 equally-distanced intervals.
In contrast, interpolating using method="index"
instead gives:
df.interpolate(method="index")
B5 5.010 5.830 9.0
Here, the difference between the lower and upper bound (4
) is divided up not by the number of intervals there are, but by the difference of the index values (30-5=25
). So, we end up with 5.8
because:
(4/25 * 5) + 5 = 5.8
Interpolation using method=time
Consider the following DataFrame with a DatetimeIndex
:
index_date = pd.to_datetime(["2020-12-01", "2020-12-02", "2020-12-15", "2020-12-31"])df = pd.DataFrame({"A":[1,np.nan,np.nan,31]}, index=index_date)df
A2020-12-01 1.02020-12-02 NaN2020-12-15 NaN2020-12-31 31.0
If we perform linear interpolation on df
:
df.interpolate()
A2020-12-01 1.02020-12-02 11.02020-12-15 21.02020-12-31 31.0
Here, the index is not taken into account - the lower bound is 1
and upper bound is 31
, and the difference is evenly spaced out in 3 intervals.
To take into account the DatatimeIndex
, pass in method="time"
:
df.interpolate(method="time")
A2020-12-01 1.02020-12-02 2.02020-12-15 15.02020-12-31 31.0
Here, the bounds are still the same - lower bound is 1
and upper bound is 31
. Instead of dividing the difference 30 by the number of intervals, we divide the difference by the length of time, which in this case is 30 days. This is why for instance, for day 15, we see an interpolated value for 15.
Specifying limit direction
Consider the following DataFrame:
df = pd.DataFrame({"A":[np.nan,np.nan,5], "B":[5,np.nan,9], "C":[5,np.nan,np.nan]})df
A B C0 NaN 5.0 5.01 NaN NaN NaN2 5.0 9.0 NaN
By default, limit_direction="forward"
, which means that we use the previous non-NaN
value to fill NaN
:
df.interpolate() # limit_direction="forward"
A B C0 NaN 5.0 5.01 NaN 7.0 5.02 5.0 9.0 5.0
To use the next non-NaN
value to fill NaN
, pass in limit_direction="backward"
:
df.interpolate(limit_direction="backward")
A B C0 5.0 5.0 5.01 5.0 7.0 NaN2 5.0 9.0 NaN
Notice how for both forward
and backward
, we may still end up with NaN
values when there are no previous/next non-NaN
values. We can prevent this by setting limit_direction="both"
, which ensures that if the previous non-NaN
value is unavailable, then the next non-value would be used, and vice versa:
df.interpolate(limit_direction="both")
A B C0 5.0 5.0 5.01 5.0 7.0 5.02 5.0 9.0 5.0
Downcasting the resulting DataFrame
By default, downcast=None
, which means that even no casting will be performed if a column type can be casted to a more specific type.
For example, consider the following DataFrame:
df = pd.DataFrame({"A":[np.nan,5], "B":[5,np.nan]})df
A B0 NaN 5.01 5.0 NaN
Performing interpolation yields:
df.interpolate() # downcast=None
A B0 NaN 5.01 5.0 5.0
Checking the column types of the resulting DataFrame:
df.interpolate().dtypes
A float64B float64dtype: object
In this scenario, it is possible to use a more specific type, namely int
, as the column type of B
. To perform this downcast, set downcast="infer"
:
df.interpolate(downcast="infer").dtypes
A float64B int64dtype: object