Splitting a Pandas DataFrame into training and testing sets
Start your free 7-days trial now!
To split a DataFrame into training and test sets, use Scikit-learn's train_test_split(~)
method.
Example
Basic usage
Suppose we wanted to split the following DataFrame into training and testing sets:
df
A B C0 3 6 101 4 7 112 5 8 123 6 9 13
We first need to divide df
into two DataFrames - one for features, and one for targets:
Here, the :
before the ,
indicates that we want to fetch all rows, and whatever is after the ,
are the columns to fetch.
We then import and use the train_test_split(~)
method to split our X
and y
into training and testing sets:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Here, note the following:
the splitting process involves random shuffling. You can turn this off by setting
shuffle=False
.the split is 75% training and 25% tests by default.
the
random_state=1
is needed for reproducibility; despite the random nature of splits, you would still end up with the same splits over and over again by using the samerandom_state
.
Just for your reference, here's X_train
:
X_train # DataFrame
A B2 5 80 3 61 4 7
Here's y_test
:
y_test # Series
3 13Name: C, dtype: int64
Changing training and test size
By default, the split is 75% training and 25% tests. We can change this by specifying the parameters train_size
and/or test_size
, both of which must be between 0 and 1. As you would expect, you just need to specify one of these.
To do a 50:50 split:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=1)X_train
A B0 3 61 4 7