Comprehensive Guide on LSTM
Start your free 7-days trial now!
A comparison between a RNN and LSTM layer:
The relationship between $\mathbf{c}_t$ and $\mathbf{h}_t$ is as follows:
Here, we are applying $\mathrm{tanh}$ to each element in vector $\mathbf{c}_t$. This means that the size of $\mathbf{c}_t$ and $\mathbf{h}_t$ is the same, that is, if $\mathbf{c}_t$ had 100 elements, then $\mathbf{h}_t$ would also have 100 elements.
Anatomy of LSTM layer
Output gate
The output gate governs how much information is passed on to the next layer as $\mathbf{h}_t$.
Where,
$\sigma$ is the sigmoid function
$\mathbf{W}^\mathbf{o}_\mathbf{x}$ is the weights dedicated to the output gate
Since all values in the vector passes through the sigmoid function $\sigma$, the output vector $\mathbf{o}$ holds values between 0 and 1, where 0 implies no information should be passed, whereas 1 implies all information should be passed.
The hidden state vector $\mathbf{h}_t$ will be computed as follows:
Here, the operation $\odot$ represents element-wise multiplication, which is often known as the Hadamard product.
Forget gate
The forget gate is used to "forget" or discard unneeded information from $\mathbf{c}_{t-1}$. We can obtain the output of the forget gate, say $\mathbf{f}$, like so:
Just like for the other gates, the value of $\mathbf{f}$ changes at each time step. This is important to keep in mind when we explore how back-propagation works in LSTM.
Memory cell
If we only have the forget gate, then the network is only capable of forgetting information. In order for the network to remember new information, we introduce memory cell.
This is not a gate because we are not using a sigmoid function here - instead we use the $\mathrm{tanh}$ curve in order to encode new information.
Input gate
The input gate governs whether the to-be added information is valuable or not. Without the input gate, then the network would be adding any new information - the input gate allows us to gain only new information that will benefit our cause.
Again, since this is a gate, we use the sigmoid function.
Summary
The gates and new-memory cell of our LSTM layer are as follows:
The new cell vector and hidden state vector are as follows:
Speeding up computation
To speed up computation, we can actually bundle the four affine transformation above. We do this by combining the weight matrix in order to create one giant weight matrix:
Where:
The reason why this is faster than performing 4 separate matrix operation is that CPU and GPU are better at computing matrices in one-go.
Deep LSTM
One prominent way of increasing the performance of LSTM is to use multiple LSTM layers:
Here, each LSTM has its own memory cell c, and this is kept within in its own LSTM layer.