Jack Dermody

Sequence

GRU Recurrent Neural Networks

So a sequence that begins with BT must end with TE (while including any number of Reber Grammar transitions in the meantime).

However their memory is so short term that they're not able to ensure that any sequence that starts with BT ends with a TE.

Both GRU and LSTM networks can capture both long and short term dependencies in sequences, but GRU networks involve less parameters and so are faster to train.

As with a SRNN, the memory buffer is updated with the layer's output at each step in the sequence, and then this saved output flows into the next item in the sequence.

For this example, we generate 500 extended (embedded) sequences of length 10 characters and split them into training and test sets.

Bright Wire includes a helper class to generate Reber Grammar and Embedded Reber Grammar sequences.

So the first item in each sequence is “B”, followed by “T” and “P” etc.

The training data contains the set of possible following state transitions (one hot encoded) at each point in each sequence.

Convolutional Neural Networks

After the max pooling there is another sequence of convolutional, RELU and max pooling layers.

Teaching a Recurrent Neural Net Binary Addition

Each single item in the sequence doesn't contain enough information by itself to make a successful prediction.

Sometimes we want the computer to be able to make predictions based on a sequence of inputs.

A better solution is to use recurrent neural networks that use a simple form of memory that allow them to learn from arbitrarily long sequences, and which use their memory to change their predictions based on what they've seen previously in the sequence.

For example the training sequence for the above addition is (left to right):

The sequences that we're going to train on are the bits of each of the two input numbers together with the single output bit at that position of the two numbers added together.

Bright Wire includes a helper class to generate random sequences of binary integers in the above data format.

As the next item in the sequence arrives it is fed into the feed forward layer as above, but instead of loading the initial memory buffer as before, the second feed forward layer gets the output that was saved from the previous time around.

During backpropagation this all happens in reverse, however the supplied memory is only updated when backpropagating (through time) the first item in the sequence.

Sequence to Sequence with LSTM

For machine learning tasks involved with classifying sequences of data there might not be a one to one mapping between input and output classifications.

One way the output can be varied in length against the input is with something called a sequence to sequence (STS) recurrent neural network.

The encoder is tasked with learning to generate a single embedding that effectively summarises the input and the decoder is tasked with learning to generate a sequence of output from that single embedding.

So a sequence to sequence architecture is actually a combination of two other architectures - one to many and many to one, each of which are useful on their own in other learning tasks.

As with a GRU, the memory buffer is updated with the layer's output at each step in the sequence, and then this saved output flows into the next item in the sequence.

The first part of a STS architecture is the encoder that learns how to encode the most relevant parts of a sequence of input into a single embedding.

The single output label “positive” might apply to an entire sentence (which is composed of a sequence of words).

The matrix contains the sequence of input vectors and the output contains the target output vector.

So for example if our sequence is {"A", “B”, “D”} the first vector will contain “A”, the second “B” and the third “D” and the output vector will be “ABD”.

To keep things simple in this tutorial, our input matrix will contain a sequence of five one-hot encoded characters and the output vector will contain a union of those characters.

1000 sequences are generated and split into training and test sets.

Also, no character will repeat in any generated sequence.

Our dictionary size (the number of possible characters) is 10, and each generated sequence is of length 5.

An example of this architecture might be learning to generate a sequence of commands after observing the current state of a system.

One to many neural networks generate a sequence of output from a single input vector.

Along the same lines as the many to one example above, the following code creates a vector that summarises a sequence and the sequence itself encoded as one hot vectors.

In this scenario, the encoder is learning to encode an input sequence into an embedding and the decoder is learning to decode that embedding into the same input sequence.

In a recurrent auto encoder the input and output sequence lengths are necessarily the same, but we are using the encoder's ability to find the relevant discriminative features of the input as it creates the single embedding from the input sequence.

This is how we might build a single embedding from a sequence of words (the document) for the purposes of document comparison.

Once the network has converged, we can throw the decoder away and use the encoder to create sequence embeddings.

So the output sequence for this tutorial will always be shorter than the input sequence.

But purely to demonstrate that the input and output sequences can in fact be different we will be discarding the last item in each sequence in this tutorial when training the decoder.

Obviously for sequence to sequence, those vectors do not need to contain the same number of rows.

The matrices each contain rows that form the input and output sequences.

As STS networks have been shown to perform better with reversed output, the input sequences are reversed for the decoder output.

In Bright Wire, the decoder and encoder are defined in two separate graphs that are stitched together to create the sequence to sequence architecture.