Gentle Introduction to Feature Scaling
Why is scaling needed?
Many machine learning algorithms can benefit from feature scaling. Here are some of the cases when scaling can help:
algorithms that make use of distance metrics will be heavily skewed by the magnitude of the features. For instance, the k-nearest neighbour algorithm utilises Euclidean distance, and whether a feature value is in grams (5000 grams) or in kilograms (5 kg) will largely dictate the distance.
algorithms that compute variance. For instance, principle component analysis selects features with the highest variance, and hence, features with higher order of magnitude will always be erroneously selected.
algorithms that make use of gradient descent. In practice, gradient descent converges much faster if feature values are smaller. This means that feature scaling is beneficial for algorithms such as linear regression that may use gradient descent for optimisation.
Scaling techniques
There are several ways to perform feature scaling. Some of the common ways are as follows:
Standardisation
Mean Normalisation
Min-max Scaling
Standardisation
The formula for standardisation, which is also known as Z-score normalisation, is as follows:
Where:
$x'$ is the scaled value of the feature
$x$ is the original value of the feature
$\bar{x}$ is the mean of all values in the feature
$\sigma$ is the standard deviance of all values in the feature. We often just stick with the biased estimate in machine learning - check out the example below for clarification.
The standardised features have a mean of $0$ and a standard deviation of $1$. Let us now prove this claim.
Mathematical proof that the mean is zero and standard deviation is one
Suppose we standardise some raw feature value $x_i$. The mean of the standardised feature values $\bar{x}'$ is:
Next, the variance of the standardised feature values $\sigma^2_{x'}$ is:
Since the variance $\sigma^2_{x'}$ is $1$, the standard deviation $\sigma_{x'}$ is of course also $1$.
Simple example of standardising
Suppose we had the following dataset with one feature:
$x_1$ | |
---|---|
1 | 5 |
2 | 3 |
3 | 7 |
Let's standardise feature $x_1$. To do so, we need to compute the mean and variance of $x_1$ - let's start with the mean:
Next, let's compute the standard deviation of $x_1$:
Great, we now have everything we need to perform standardisation on $x_1$!
In statistics, we often compute the unbiased estimate of the standard deviation, that is we divide by $n-1$ instead of $n$:
When performing standardisation, we almost never use this version because we are only interested in making the mean of the feature 0 and the standard deviation 1.
For notational convenience, let's express the scaled feature as $d$ instead of $x'$. For each value in $x_1$, we need to perform:
Where:
$d^{(i)}_1$ is the scaled value of the $i$-th value in feature $x_1$.
$x^{(i)}_1$ is the original $i$-th value in feature $x_1$.
For instance, the first scaled feature value is:
And for the second is:
And so on.
The scaled values of $x_1$ are summarised below:
$d_1$ | |
---|---|
1 | 0 |
2 | -1.23 |
3 | 1.23 |
If there were other features, then you would need to perform these exact same steps for every single one of those features.
The pattern of the data points are preserved
The overall layout of our data points should look the same even after performing standardisation. To demonstrate, here is a side-by-side comparison of a before and after:
Before | After |
---|---|
![]() | ![]() |
Can you see how the overall pattern of our data points is preserved? The key difference though is that the standardised data points are centered around the origin with an overall spread of one.
Mean normalisation
The formula for mean normalisation is as follows:
Where:
$x'$ is the scaled value of the feature
$x$ is the original value of the feature
$\bar{x}$ is the mean of all values in the feature
$x_{min}$ is the smallest value of the feature
$x_{max}$ is the largest value of the feature
The denominator, $x_{max}-x_{min}$, is essentially the range of the feature. By applying this transformation, we can ensure that the following property holds:
all values in the scaled feature $x'$ lies between $-1$ and $1$
the mean of the scaled feature $x'$ is $0$.
In practice, mean normalisation is not often used. Instead, either standardisation or min-max scaling is used.
Min-max scaling
The formula for min-max scaling is very similar to that for mean normalisation:
After the transformation, we can guarantee that all the values in the scaled feature $x'$ lie between $0$ and $1$.
Misconceptions
Scaling the dependent variable
There is no need to perform scaling for dependent variables since the purpose of feature scaling is to ensure that all features are treated equally by our model. This is the reason why "feature scaling" specifically contains the word "feature"!
Scaling training and testing data separately
We should not scale training and testing data using separate scaling parameters. For instance, suppose we want to scale our dataset, which has been partitioned into training and testing sets, using mean normalisation. The scaling parameters for mean normalisation of a particular feature are its:
mean $x'$
minimum $x_{min}$
maximum $x_{max}$
The correct way of performing mean normalisation would be to compute these parameters using only the training data, and then instead of re-computing the parameters separately for the testing data, we reuse the parameters we obtained for the training data. Therefore, we need to ensure that we store the parameters for later use.
The reason for this is that feature scaling should be interpreted as part of the model itself. In the same way the model parameter values obtained after training should be used to process the testing data, the same parameter values (e.g. $x_{min}$) obtained for feature scaling should be used for the testing data.
Best scaling technique
There is no single best scaling technique. That said, either standardisation or min-max scaling is often used in practice instead of mean normalisation. One could also compare the performance between the two to decide which to ultimately choose.
Normalisation and standardisation
The terms normalisation and standardisation are often confused. In machine learning, normalisation typically refers to min-max scaling (scaled features lie between $0$ and $1$), while standardisation refers to the case when the scaled features have a mean of $0$ and a variance of $1$.
Performing feature scaling on Python
Standardisation
To perform standardisation, use the StandardScaler
module from the sklearn
library:
import numpy as npfrom sklearn.preprocessing import StandardScaler
# 4 samples/observations and 2 features
# Fit and transform the datascaler = StandardScaler()scaled_X = scaler.fit_transform(X)
scaled_X
array([[ 1.18321596, 0. ], [ 0.50709255, -0.53452248], [-1.52127766, 1.60356745], [-0.16903085, -1.06904497]])
Note that a new array is returned and the original X
is unaffected.
We can confirm that the mean of the features of scaled_X
is 0
:
array([-1.38777878e-17, 0.00000000e+00])
Note that the reason the mean for the first column is not exactly 0 is due to the nature of floating numbers.
To confirm that the variance of the features of scaled_X
is 1
:
array([1., 1.])
Min-max scaling
To perform min-max scaling, use the MinMaxScaler
module from the sklearn
library:
import numpy as npfrom sklearn.preprocessing import MinMaxScaler
# 4 samples/observations and 2 features
# Fit and transform the datascaler = MinMaxScaler()scaled_X = scaler.fit_transform(X)scaled_X
array([[1. , 0.4 ], [0.75, 0.2 ], [0. , 1. ], [0.5 , 0. ]])
Note the following:
a new array is returned and the original
X
is kept intact.the column values of
scaled_X
now range from $0$ to $1$.