$$\mathbf{o}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{o})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{o})}+\mathbf{b}^{(\mathbf{o})})$$

Where,

$\sigma$ is the sigmoid function
$\mathbf{W}^\mathbf{o}_\mathbf{x}$ is the weights dedicated to the output gate

Since all values in the vector passes through the sigmoid function $\sigma$, the output vector $\mathbf{o}$ holds values between 0 and 1, where 0 implies no information should be passed, whereas 1 implies all information should be passed.

The hidden state vector $\mathbf{h}_t$ will be computed as follows:

$$\mathbf{h}_t=\mathbf{o}\;\odot\;\mathrm{tanh}(\mathbf{c}_t)$$

Here, the operation $\odot$ represents element-wise multiplication, which is often known as the Hadamard product.

Forget gate

The forget gate is used to "forget" or discard unneeded information from $\mathbf{c}_{t-1}$. We can obtain the output of the forget gate, say $\mathbf{f}$, like so:

$$\mathbf{f}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{f})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{f})}+\mathbf{b}^{(\mathbf{f})})$$

$$\mathbf{c}_t=\mathbf{f}\odot\mathbf{c}_{t-1}$$

Just like for the other gates, the value of $\mathbf{f}$ changes at each time step. This is important to keep in mind when we explore how back-propagation works in LSTM.

Memory cell

If we only have the forget gate, then the network is only capable of forgetting information. In order for the network to remember new information, we introduce memory cell.

$$\mathbf{g}=\mathrm{tanh}(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{g})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{g})}+\mathbf{b}^{(\mathbf{g})})$$

This is not a gate because we are not using a sigmoid function here - instead we use the $\mathrm{tanh}$ curve in order to encode new information.

Input gate

The input gate governs whether the to-be added information is valuable or not. Without the input gate, then the network would be adding any new information - the input gate allows us to gain only new information that will benefit our cause.

$$\mathbf{i}=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{i})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{i})}+\mathbf{b}^{(\mathbf{i})})$$

Again, since this is a gate, we use the sigmoid function.

Summary

The gates and new-memory cell of our LSTM layer are as follows:

$$\begin{align*} \mathbf{f}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{f})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{f})}+\mathbf{b}^{(\mathbf{f})})\\ \mathbf{g}&=\mathrm{tanh}(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{g})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{g})}+\mathbf{b}^{(\mathbf{g})})\\ \mathbf{i}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{i})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{i})}+\mathbf{b}^{(\mathbf{i})})\\ \mathbf{o}&=\sigma(\mathbf{x}_t\mathbf{W}_\mathbf{x}^{(\mathbf{o})}+\mathbf{h}_{t-1}\mathbf{W}_\mathbf{h}^{(\mathbf{o})}+\mathbf{b}^{(\mathbf{o})})\\ \end{align*}$$

The new cell vector and hidden state vector are as follows:

$$\begin{align*} \mathbf{c}_t&=\mathbf{f}\odot\mathbf{c}_{t-1}+\mathbf{g}\odot\mathbf{i}\\ \mathbf{h}_t&=\mathbf{o}\odot\mathrm{tanh}(\mathbf{c}_t) \end{align*}$$

Speeding up computation

To speed up computation, we can actually bundle the four affine transformation above. We do this by combining the weight matrix in order to create one giant weight matrix:

$$\mathbf{x}_t\mathbf{W}_\mathbf{x}+\mathbf{h}_{t-1}\mathbf{W}_h+\mathbf{b}$$

Where:

$$\begin{align*} \mathbf{W}_\mathbf{x}&=\begin{pmatrix} \mathbf{W}_\mathbf{x}^{(\mathbf{f})}& \mathbf{W}_\mathbf{x}^{(\mathbf{g})}& \mathbf{W}_\mathbf{x}^{(\mathbf{i})}& \mathbf{W}_\mathbf{x}^{(\mathbf{o})} \end{pmatrix}\\ \mathbf{W}_\mathbf{h}&=\begin{pmatrix} \mathbf{W}_\mathbf{h}^{(\mathbf{f})}& \mathbf{W}_\mathbf{h}^{(\mathbf{g})}& \mathbf{W}_\mathbf{h}^{(\mathbf{i})}& \mathbf{W}_\mathbf{h}^{(\mathbf{o})} \end{pmatrix}\\ \mathbf{b}&=\begin{pmatrix} \mathbf{b}^{(\mathbf{f})}& \mathbf{b}^{(\mathbf{g})}& \mathbf{b}^{(\mathbf{i})}& \mathbf{b}^{(\mathbf{o})} \end{pmatrix} \end{align*}$$

The reason why this is faster than performing 4 separate matrix operation is that CPU and GPU are better at computing matrices in one-go.

Deep LSTM

One prominent way of increasing the performance of LSTM is to use multiple LSTM layers:

Here, each LSTM has its own memory cell c, and this is kept within in its own LSTM layer.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!